SC19: The Intersection of HPC and AI

Santosh Rao

November 19, 2019

SC19, the premier supercomputing conference, is happening this week (November 17-22) at the Colorado Convention Center in Denver. You can find NetApp at Booth #249. This year, NetApp is highlighting technologies to satisfy both HPC and AI needs, including important collaborations with our partners NVIDIA and ScaleMatrix. The NetApp AI team is on hand to tackle your questions, and we’ll be demonstrating the latest technologies to accelerate HPC and AI workloads.

You’ve probably noticed that many of the recent blogs in this series have focused on the latest happenings and announcements at AI conferences and other industry events. The pace of innovation this year has been extraordinary, and NetApp’s AI team has been extremely busy attending events, meeting with customers, and evangelizing the benefits of NetApp AI technologies spanning from edge to core to cloud.

This blog is coming to you—in near real time—from SC19 in Denver, Colorado, a show where NetApp has maintained a presence for years. So far at this year’s show, I’m struck by the increasing intersection between HPC and AI. In fields ranging from aircraft design to weather forecasting to seismic exploration, researchers and engineers are augmenting traditional HPC approaches with AI to produce deeper insights and faster results.

As I’ve discussed in previous blogs (see my recent blog Infrastructure Design for Autonomous Vehicle Development ) your path to AI may depend on where you start from and what your existing skillsets are. Our job at NetApp is to deliver solutions that help you make a more seamless transition to AI regardless of your entry point by removing bottlenecks and enabling data pipelines that deliver critical data where it’s needed at the speed it’s needed.

It’s no surprise that the teams attending SC19 are largely approaching AI from an HPC perspective. Yesterday afternoon (Monday, November 18) NVIDIA CEO Jensen Huang delivered a keynote and introduced two new solutions in collaboration with NetApp and others to help HPC teams succeed with HPC and AI: NVIDIA DGX SuperPOD and NVIDIA Magnum IO.

NVIDIA SuperPOD Brings Supercomputing to the Enterprise

NetApp and NVIDIA have been collaborating closely for a number of years, combining the benefits of NVIDIA expertise in GPU computing and NetApp’s proven solutions for high-performance storage and advanced data management. ONTAP AI —a market-leading joint solution combining NVIDIA DGX-1 and DGX-2 with NetApp All Flash FAS storage—was first introduced in 2018. (I described ONTAP AI in my previous blogs here and here .)

NVIDIA DGX SuperPOD is the newest addition to our joint solutions portfolio, simplifying supercomputing and enabling AI. The DGX SuperPOD is designed to support extreme HPC and AI workloads requiring multi-petaflop-scale compute power, in a systemized solution that can be deployed in just weeks, making HPC-scale technology more accessible to the enterprise. This turnkey solution takes the complexity and guesswork out of HPC and delivers a complete, validated solution stack (including best-in-class compute, switches, networking, and storage) for deployments at scale. Running NVIDIA AI software on the DGX SuperPOD provides a high-performance DL training environment for large scale multi-user AI software development teams.

DGX SuperPOD configurations start at 32 DGX-2 nodes, each with 16 NVIDIA V100 Tensor Core GPUs. Each node delivers 2 petaFLOPS of AI compute power—the equivalent of hundreds of CPU-based servers. DGX SuperPOD reduces physical footprint and power consumption to a fraction of that of a comparable traditional compute cluster.

NetApp complements the DGX-2 compute capabilities with high-performance NetApp EF600 all-flash NVMe storage . NetApp E-Series and EF-Series storage has been highly regarded in the HPC community for more than a decade. The EF600 delivers 2M sustained IOPS, response times under 100 microseconds, 44GBps of throughput, and 99.9999% availability. With industry-leading density, the EF600 is the only end-to-end NVMe system to support 100Gb NVMe over InfiniBand (IB), 100Gb NVMe over RoCE, and 32Gb NVMe over FC, future-proofing your DGX SuperPOD.

To find out more about DGX SuperPOD, be sure and check out the following links:

https://www.youtube.com/watch?v=NoCdoBl9vPw&feature=youtu.be

NVIDIA Magnum IO Eliminates I/O Bottlenecks

NetApp and NVIDIA use GPU and data acceleration technologies to address emerging computing workloads like AI, genomics, ray tracing, analytics, databases, and seismic processing and interpretation. CPUs are increasingly a bottleneck to continued performance improvement on these workloads. RAPIDS technology , introduced earlier this year, bridges the CPU and GPU universes to accelerate the transition to GPUs.

With every performance improvement in one area, however, the bottleneck moves somewhere else. In GPU computing, the CPU typically controls data loading from storage to GPUs, creating I/O bottlenecks, especially for uses cases where real-time data access is critical. NetApp and NVIDIA are working to eliminate this bottleneck and deliver further acceleration for HPC workloads.

NVIDIA Magnum IO is a multi-GPU, multi-node networking and storage I/O optimization stack. APIs integrate compute, networking, distributed file systems, and storage to maximize I/O performance and functionality. Magnum interfaces with CUDA-X™ HPC and AI libraries to accelerate I/O for a broad range of HPC, AI, data analytics, and visualization uses cases.

GPUDirect Storage is a key feature of Magnum IO. It opens a direct data path between GPU memory and storage, eliminating CPU bottlenecks. Taking the CPU out of the I/O pathway increases the demands on storage. With support for RDMA, NetApp EF600 all-flash NVMe storage delivers the performance and reliability required to keep up with the I/O demands of data-hungry GPUs.

To find out more about NVIDIA Magnum IO, be sure and check out the following links:

NetApp and NVIDIA: Taking High-Performance Computing to the Next Level
NVIDIA Developer Blog: GPUDirect Storage: A Direct Path Between Storage and GPU Memory

AI Anywhere with ScaleMatrix

Successful AI requires integration of a wide range of hardware and software elements. In addition to our NVIDIA partnership, NetApp is joining forces with a rapidly growing ecosystem of the most innovative AI vendors, offering a wide range of solutions and services to streamline on-premises, cloud, and colo deployments.

Delivering the power density and cooling necessary for the latest high-performance GPUs is one of the challenges that comes with scaling AI projects. NetApp first introduced solutions and hosting services from our colocation partner ScaleMatrix at GTC in San Jose this year, combining ScaleMatrix and Dynamic Density Control (DCC) cabinet technology with the power of ONTAP AI.

Now we are taking the partnership a step further by leveraging edge capabilities and the mobility and modularity of the DDC cabinet tech to deliver ONTAP AI as a plug-and-play solution—orderable with a single SKU—that can be deployed anywhere, providing a self-contained environment, with guaranteed air flow, integrated security, and fire and noise suppression. ScaleMatrix solutions are ideal for AI and other high-performance workloads, including edge inferencing in retail, healthcare, and manufacturing. Suitable for office locations, the solution can be up and running within minutes—you just plug it in and power it on. Systems can be deployed and redeployed to another location with ease. No data center needed.

ScaleMatrix is showcasing its mobile (R-Series) and modular (S-Series) cabinets with ONTAP AI configurations in its booth at SC19. You’ll find ScaleMatrix at booth #2131. Read the ScaleMatrix SC19 blog here . And the press release here.

To find out more about ScaleMatrix and ONTAP AI, visit http://www.scalematrix.com/netapp and check out the following videos:

More Information and Resources

NetApp AI and NetApp data fabric technologies and services can jumpstart your company on the path to success. Visit us at booth #249 at SC19, or check out the following links to learn more:

And check out these resources to learn about ONTAP AI:

Solution Brief: NetApp ONTAP AI
White Paper: Edge to Core to Cloud Architecture for AI
White Paper: Building a Data Pipeline for Deep Learning
NetApp Validated Architecture: ONTAP AI – NVIDIA DGX-2 POD with NetApp AFF A800
NetApp Validated Architecture: NetApp ONTAP AI with Mellanox Spectrum Switches
Technical Report: AI at Scale with Trident, Kubernetes, and Kubeflow
Customer Story: Cambridge Consultants Uses AI for World-Changing Innovation

And don’t miss our recent series about AI in the automotive industry:

Santosh Rao

Santosh Rao is a Senior Technical Director and leads the AI & Data Engineering Full Stack Platform at NetApp. In this role, he is responsible for the technology architecture, execution and overall NetApp AI business.

Santosh previously led the Data ONTAP technology innovation agenda for workloads and solutions ranging from NoSQL, big data, virtualization, enterprise apps and other 2nd and 3rd platform workloads. He has held a number of roles within NetApp and led the original ground up development of clustered ONTAP SAN for NetApp as well as a number of follow-on ONTAP SAN products for data migration, mobility, protection, virtualization, SLO management, app integration and all-flash SAN.

Prior to joining NetApp, Santosh was a Master Technologist for HP and led the development of a number of storage and operating system technologies for HP, including development of their early generation products for a variety of storage and OS technologies.

View all Posts by Santosh Rao

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion