BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Hello and welcome to another GTC 2022 breakout session where we're going to be talking about solutions from NetApp and Nvidia that enable customers to advance their machine learning and deep learning initiatives of any size from the smallest development efforts to worldclass supercomputing capabilities. My name is David Arnett and I'm a principal technical marketing engineer with the NetApp solutions team. Helping me with this presentation from NVIDIA is Jeff Weiss, director of solution architecture and engineering for DGX systems. Over the next 20 minutes or so, I'm going to talk about some of our existing solutions for AI infrastructure, as well as some new things we have coming soon. I'm going to cover the ONAP AI reference architecture, which has been very successful as a starting point for customers looking to deploy a powerful but operationally simple do-it-yourself AI infrastructure. And then we've taken that a couple steps further with the ONAP AI integrated solution which offers a single skew procurement and support model for ONAP AI and the NVIDIA DGX foundry solution that includes everything an enterprise or organization needs to jump into AI model development and training on a monthly rental basis. In addition to ONAP AI, I'm going to talk about the NetApp and BGFS solution for HPC and big cluster machine learning and deep learning workloads and how NetApp will be supporting the NVIDIA AI enterprise solution that brings the power of NVIDIA GPUs to enterprise operations based on VMware. Before I dive into more technical details on the solutions, I want to touch on why they're important. NetApp has been working with customers in this space for several years now and there's a few important things we've learned from that. The infrastructure required to do this kind of work is pretty specialized in a time when customers are closing down data centers and building less infrastructure, making the availability of resources one of the biggest challenges data scientists face when trying to create new capabilities. Cloud capabilities are expanding or and are a great place for initial experimentation.But because of the sometimes sensitive nature of the data involved or because cloud operations can become very expensive very quickly, many customers are finding they need on premise infrastructure also which can create even more complexity that slows down AI initiatives.The other big challenge we've seen is getting projects from the proof of concept stage into production. It's relatively easy for data scientists and data engineers to collate a data set for initial experimentation and testing. But it's a completely different story to create a consistent and automated data pipeline that allows regular retraining of production models or streamline development of new use cases. NetApp has a portfolio of storage platforms and software to tools that position data where it's needed when it's needed to eliminate the challenge when data is coming from numerous sources or locations.With many customers already supporting large deployments of VMware virtual infrastructure, it's a natural fit to integrate GPUs into VMware clusters to help drive even more mainstream adoption of machine learning and deep learning capabilities. Jeff is going to dive into the NVIDIA AI enterprise suite in a few minutes. But since NetApp has been supporting enterprise deployments of VMware for many years, we can offer all of the same data management capabilities to virtual environments that we do to bare metal environments. So let's take a deeper look at these solutions. Starting with the infrastructure customers are using to tackle the largest machine learning and deep learning and HPC challenges.In order to support massive compute clusters operating as a single unit, a storage s subsystem needs to be able to spread the workload across a large number of storage nodes. Workloads like traditional HPC simulation as well as machine learning challenges like natural language processing can require that a single file is read by the entire cluster concurrently [snorts] and generally requires a parallel file system to ensure that storage doesn't become a bottleneck and limit the compute performance. The NetApp E-series team has been working with the BGFS parallel file system for several years now and we're now working with the maintainer of BGFS, Think Park, to sell and support it through standard NetApp channels. That means we can offer a complete software and storage hardware solution for the largest and most demanding deployments. To support large-scale clusters, we've created a building block that can be used individually or combined with other blocks to scale capacity and performance as needed. Each building block consists of two NetApp EF600 storage systems and two x86 servers running the BGFS storage services. Both servers are directly connected to both EF600s with redundant connection and the servers are then connected into the upstream storage network for client access. That building block is the basis for much larger configurations up to many exabytes in scale. And since the file system can scale both metadata and storage service operations, it's a simple to expand the cluster by adding building blocks as needed. I mentioned the redundant connections that make each building block a high availability unit. And BGFS can mirror data across storage services and across building blocks if desired to deliver a file system with massive capacity and performance and the resiliency required for computing at this scale because nobody wants to be two weeks into a 3-week long job when the file system goes offline and the job crashes.As part of NetApp's commitment to BGFS, we've developed Ansible automation to deploy and manage it at scale. And we've developed a Kubernetes CSI driver to enable automatic provisioning with Kubernetes.is quickly becoming a key component of the environment for data science operations at scale. And I'll touch onthat in a few more slides when I talk about our data ops toolkit. We are currently in the process of testing this configuration for NVIDIA DGX Super Pod certification where we'll be testing three of these building blocks with 20 DGX A100 systems as the standard scalable unit and that can be combined with up to six other scalable units to create a full super pod with 140 DGX A100 systems. So I talked about the building block hardware just now and there are a couple of ways those can be used. Both metadata and storage services can be run on a block which would deliver bytes per second of read bandwidth and 22 GB per second of writes or the block can be used for storage services only to deliver up to 74 GB per second of read bandwidth. We've validated this using Lenovo x86 servers, but customers are free to use whichever server vendor they prefer and will still support the BGFS software and NetApp hardware as a single solution. In addition to the bu flexibility of the building block itself, BGFS offers industryleading burst buffer capability with beyond ondemand file systems to provide the SLA levels demanded by enterprise customers. NetApp is also driving new innovation with BGFS by creating the BGFS CSI driver for Kubernetes integration and contributing code for GPU direct storage support that will be generally available in the March 2022 release of BGFS. Finally, we've integrated this solution into our support services capabilities. So, NetApp can cover to our support services capabilities. So NetApp can cover level one and level two support with at level three escalation directly to think park engineers. Now I'm going to switch gears to the ONAP AI reference architecture which offers a different set of capabilities for developing enterprise machine learning and deep learning data pipelines.The ONAP AI reference architecture was our first AI infrastructure solution and has been updated a couple of times now as the server systems and network options have changed. From a use case perspective, this architecture is optimized for a large number of smaller concurrent jobs rather than single large cluster workloads. In many cases, the enterprise lines of business are experimenting with machine learning and deep learning in ad hoc ways. But we're talking to a lot of enterprise IT departments that are creating an AI center of excellence to offer these capabilities as a central IT function, enabling the NU and data management infrastructure.ONTAP AI easily supports many concurrent training and inferencing jobs and integrates directly into Kubernetes-based workload orchestrators for very flexible storage management. This architecture is also based on DGX A100 systems and we validated up to eight of the A100 systems per HA pair as shown here. ONTAP is primarily an Ethernetbased storage system and this architecture uses Melanox 100 or 200 gig Ethernet switches for the storage network. The AFF A800 storage system has up to 48 internal NVME drives of up to 30 terabytes each and is good for about 25 GB per second of read throughput per HA pair. And the controllers can be clustered and scale out to deliver up to 300 gigabytes per second. This architecture has been tested with both Ethernet and Infiniband compute fabrics and the performance is comparable. So customers are free to choose whichever use whichever network topology they prefer. I'll talk about some of the integrated data management features on the in the data ops toolkit slide. But with native replication connectivity to edge and cloud-based ONAP systems as well as object storage, ONAP enables comprehensive and automated data pipelines for a wide range of machine learning and deep learning workflows.So we've had a lot of success with ONAP AI, but there are still some challenges with building something like this that many customers don't want to deal with. To make procurement and support for ONAP AI even easier, NetApp and Nvidia have worked with our joint distribution partner Arrow to offer the ONAP AI integrated solution, which is another way of saying the whole stack is available as a single package and supported through a single support line that starts at NVIDIA and is backed by NetApp whenever necessary. This program offers configurations with 1 to8 DGX A100 systems and either A400, A700 or A800 storage systems. This allowed customers to build a somewhat custom configuration but guarantees that it will perform as expected because we've already done the engineering validation. Arrow performs all the delivery and installation and services, ensuring that the system is deployed correctly every time. And with a single number to call for support, it's easy to get help with even complex problems as NVIDIA handles theDGX hardware and software and networking and can engage NetApp support for assistance if issues are storage related. We can even offer a nice MLOps platform from Domino Data Lab to create a turnkey solution so data scientists can start creating results in days instead of weeks or months. The last ONAP AI solution I'm going to cover is for customers who don't even want to own the infrastructure at all, but would rather rent it on a monthly basis. In order to make life even easier in the age of cloud first operational initiatives, NVIDIA DGX Foundry is a subscription-based service that combines dedicated DGX A100 servers and NetApp storage with Nvidia's internally developed N MLOps software called Base Command Platform. NVIDIA DGX Foundry enables customers to rent a world-class supercomput when they need it, purchasing it. Offering training performance roughly double what can be realized on the public cloud, the DGX Foundry streamlines model development and training using [snorts] NGC pre-trained models and software containers. An end users simply log in, upload data, and start experimenting with robust job management and monitoring tools. base command platform ensures users are getting the most out of the infrastructure to maximize productivity and return on investment. This service like I said is based on the ONAP AI infrastructure that I showed a minute ago and it's hosted in Equinex data center currently on both the east coast and west coast of the United States with some other locations around the world uh on the road map. And you can even take this environment for a test drive using the NVIDIA Launchpad service that Jeff is going to touch on here in just a minute. But before I go there, I want to talk about one more thing about how we've integrated the storage management features of the storage systems into the tools and frameworks that data engineers and data scientists are using. Machine learning and deep learning as an application space was born in the cloudnative era and most data scientists and data engineers are already working with containers in one way or another. Containers provide a great way to encapsulate code and software configurations that make sharing and collaborating very easy. But they can complicate data access and governance paradigms as users may run multiple concurrent experiments with slightly different data sets or hyperparameter configurations.The data ops toolkit is intended to streamline workflows for data scientists and engineers while also allowing IT administrators to retain control of gold copies of data, maintain version control of both data and models under development, and even provide standard backup and recovery capabilities for these now critical uh IP assets. The toolkit is an importable library of Python functions and a that works both as a standalone tool for shops that are using containers like Jupyter notebooks but not running a full orchestration layer like Kubernetes. For organizations that do use Kubernetes for workload orchestration, the toolkit supports an even wider variety of data sources due to integration with the NetApp Astra Trident Kubernetes storage integration plugin.From a capability perspective, the toolkit allows data scientists and engineers to perform a number of tasks without involving storage administrators. Users can provision new volumes for new projects or experiments. Instantly clone existing data volumes or workspaces without consuming additional storage capacity and snapshot data and workspace volumes for version control or traceability and even trigger updates from a number of potential other data fabric sources. These functions can all also be integrated into automated workflows with CubeFlow or Apache Airflow and Chair Airflow and CI/CD development environments to help create streamlined pipelines for model retraining or experimentation with the latest a so I've covered our solutions for the large and larger use cases, but many customers are looking for more cost-effective ways to start their AI journey. At this point, I'm going to ask Jeff to explain the NVIDIA AI Enterprise Suite for mainstream VMware environments.Hello, this is Jeff Weiss from NVIDIA. Thank you for listening to this overview of the NVIDIA AI Enterprise offering. Here's a great graphic that highlights the details of our joint solution. From the bottom up, we highlight its components. Starting with the Nvidia certified servers. These servers which are provided by our OEM partners will utilize our industry-leading GPUs and DPUs. NVIDIA certified server configurations have been tested specifically with the NVIDIA AI enterprise workloads so that it takes the guesswork out of which server configurations our customers need to select for their workloads. Moving up the stack, we see how the joint solution integrates with the VMware software stack for managing workloads on demand with tools like VSCenter and Tanzoo to simplify the deployment of vGPU workloads. Features like DRS initial placement and votion as well as TKG pod management. Next is the Nvidia AI enterprise suite of AI software. This runs on top of vSphere 7 update 2 on Nvidia certified servers. It includes key enabling technologies and software from Nvidia for rapid deployment, management, and scaling of AI workloads in the virtualized data center. The NVIDIA AI enterprise suite enables IT administrators, data scientists, and AI researchers to quickly run NVIDIA AI applications and libraries optimized for GPU acceleration by reducing deployment time and ensuring reliability of performance. Leveraging the tools they already use in production today, NVIDIA AI enterprise software suite is optimized to run on vSphere. It streamlines deployment and management of AI machine learning and data analytics workloads. NVIDIA AI enterprise includes AI and data analytics tools and frameworks, GPU virtualization software and networking and infrastructure management software for optimized performance. It offers a complete AI software offering for enterprise including NVIDIA AI and data science applications and toolkits, NVIDIA GPU and network operators, NVIDIA CUDA X, DOA and Magnum IO, as well as virtual GPU software and a private registry. NVIDIA AI enterprise is certified for enterprise data centers running vSphere on NVIDIA certified systems through our OEM partners. It includes support, update and maintenance, phone portal and email support for issue resolution, software maintenance and new releases. Customers can open tickets for support of scaling and performance issues and we have ongoing NVIDIA system certification process as well as an integrated partner support. Key benefits of moving to vGPU workloads on vSphere are bare metal performance. The ability to deliver performance virtually indistinguishable from bare metal environments. Leverage data center management and monitoring best practices and tools. This allows enterprise to provide GPU compute resources on demand for their end users. Optimal resource utilization by provisioning GPU resources with fractional or multi-GPU machine workloads. And finally, it improves business continuity that enables responsiveness to changing business requirements and remote teams. Okay, so are you ready to fasttrack your AI journey? NVIDIA Launchpad is a global program that provides immediate and free remote access for AI practitioners, data scientists, and IT administrators to get hands-on experience for AI development and infrastructure optimization. We have available curated labs on NVIDIA Launchpad for NVIDIA AI Enterprise and are purposely designed to simulate the most common use cases for AI. You can now instantly conduct an evaluation of NVIDIA AI Enterprise without having to set up your own environment. If you're interested in learning more, go to nvidia.com/try-aito apply and get started. Now, I'm going to hand it back to Dave to summarize the joint NVIDIA and NetApp AI offerings and next steps. Before I wrap up, I'd like to mention that we'll be validating the NVIDIA AI enterprise solution on platforms like FlexPod and other server vendors in the coming months. And if you're using supported NetApp storage system with the underlying VMware infrastructure, the data ops toolkit I talked about will work just as well in the virtual environment as it does in larger bare metal environments. It may even offer more value as more data scientists and engineers can leverage those capabilities to save time and storage resources.So, as a quick summary, over the last 20 minutes, I've talked about ONTAP solutions where you can build your own configuration using our detailed design and sizing guidance. You can buy a complete GPU accelerated infrastructure as a single skew or you can rent one on a monthly basis. I also covered the configuration we'll be testing for DGX super pod certification and Jeff covered the VMware solution for enterprise AI that we'll be supporting as well combined with the other NetApp storage solutions like storage grid object storage and our cloud services offering. We've got a complete portfolio of storage solutions for hybrid machine learning and deep learning workflows. And I think everyone realizes at this point that they're all going to be hybrid in the end. I didn't go into our software tools like Snap Mirror, Cloud Sync, and XCP that allow data movement between any of the endpoints I've discussed, but those help round out the capability to offer data where and when it's needed with automated data management and optimal resource usage for any workflow or workload. If you'd like more information, you can check out the documentation for these solutions we've got listed here. You can also try both DGX Foundry and Nvidia AI Enterprise Suite using the Launchpad service. And you can always reach out to ng-ai-inquiry@netapp.com for direct access to NetApp's AI solution experts. Thanks very much and have a great day.
Deploy the industry's leading AI infrastructure your way. Learn how you can get access to NetApp data management through VMware for GPU-enabled accelerated compute workloads in the data center, as well as AI model training.