AI Model Training & AI Inferencing

Contents

Share this page

Sathish Thyagarajan

May 4, 2023

732 views

Are you part of an organization that builds and trains its own AI models? Or do you use pretrained AI models hosted as API end points to process text, images, or other data? This is an exciting time to be part of the rapidly evolving age of AI. Since I wrote my last blog post just a few months ago, there has been an explosion in the use of AI models that can not only read, write, chat, and code but have also broadened with newer capabilities like text-to-image generation. Some firms are learning how to build and train large AI models to grow revenue, while others are finding innovative ways to monetize from pretrained AI models. Either way, AI is rapidly reaching critical mass as AI adoption grows at an unprecedented rate. Like a cherry on the AI sundae, generative AI and large language models (LLMs) have accelerated the need for speed among businesses even more.

NetApp and Lenovo for AI model training

In our relentless effort to support our customers who are training AI models, NetApp delivers a midrange cluster architecture using NetApp® storage and Lenovo ThinkSystem servers accelerated with NVIDIA GPUs optimized for artificial intelligence and machine learning workloads. This solution is designed to handle AI model training on large datasets. It uses the processing power of GPUs alongside traditional CPUs with a flexible scale-out architecture that uses NVIDIA-certified Lenovo ThinkSystem servers composed of eight A100 NVIDIA GPUs each, alongside a single NetApp AFF A400 all-flash storage system. The NetApp storage system delivers AI training performance comparable to local SSD storage while enabling easy sharing of data between servers. In this NetApp and Lenovo joint engineering solution, we performed image recognition training as specified by the MLPerf benchmark to reach the desired accuracy. This training gives customers a turnkey reference architecture that reduces infrastructure overhead, improves performance, and streamlines data management for running AI training models across enterprises.

NetApp continues to provide enterprises with highly efficient and cost-effective performance when executing multiple AI training jobs in parallel. NetApp ONTAP^® offers built-in data protection capabilities to meet low recovery point objectives (RPOs) and recovery time objectives (RTOs) with no data loss. It provides optimized data management with tamperproof NetApp Snapshot™ copies and clones to streamline the data pipelines for A/B testing in AI/ML experiments and compute-intense AI/ML model training. NetApp ONTAP^® also supports NFSv4 over RDMA and NVIDIA GPU Direct Storage™ (GDS) which enables high throughput and low latency, delivers fast performance, solves storage bottlenecks and helps optimize resources for AI and ML workloads for better ROI on GPU compute infrastructure.

AI inferencing at the edge

Many companies are generating increasingly massive volumes of data at the network edge. To achieve maximum value from smart sensors and IoT data, organizations are looking for real-time event streaming solutions that enable edge computing. AI inferencing is one of the drivers of this trend. Several emerging applications, such as advanced driver-assistance systems (ADAS), Industry 4.0, and smart cities require the processing of continuous data streams with near-zero latency. Edge servers with GPUs provide sufficient computational power for these workloads, but limited storage is often an issue, especially in multiserver environments. To address this gap, NetApp AI solutions worked with Lenovo on ThinkEdge, NVIDIA-certified servers integrated with NetApp entry level storage for edge AI inferencing.

The test procedure for this validation includes inference engines optimized for the GPU in compute servers with a range of datasets like ImageNet, COCO (object detection and segmentation), Criteo CTR (clickthrough rate), and BraTS (brain tumor segmentation). These training sets can be applied to various industrial use cases. Companies can use NetApp storage with GPU-enabled Lenovo ThinkEdge servers to virtualize transformative AI systems and support customers with edge AI deployments.

NVIDIA, Lenovo, and NetApp

NetApp is also closely aligned with our partner NVIDIA, collaborating for over 3 years to develop and release joint solutions that help customers enable enterprise-class AI. NetApp supports NVIDIA AI Enterprise environments, an end-to-end cloud-native suite of AI and analytics software, built with extensive NVIDIA library of frameworks and pretrained models, supported by NVIDIA to run on NVIDIA-certified systems. NVIDIA AI Enterprise is certified for deployment on broadly adopted enterprise platforms, including VMware and multicloud instances on AWS, Microsoft Azure, and Google Cloud.

NetApp customers can use deep learning defect detection and computer vision AI models to monitor and analyze video footage right at the time and place it’s created. This ability allows large enterprises to investigate manufacturing defects and meet production quality before the product hits the assembly line. A wide range of industries take advantage of these models, from biopharmaceutical, automotive, medical, robotics, and security to businesses revolutionizing their operations with real-time cybersecurity in fintech or monitoring foot traffic in retail stores with edge AI.

NetApp is committed to driving the engagements in retail, manufacturing, healthcare, and more with complete end-to-end solutions. As the data authority on hybrid cloud, NetApp collaborates with a network of AI partners in all aspects of constructing a data pipeline for AI training and AI inferencing.

Learn more

To learn more about NetApp’s joint solution with NVIDIA and Lenovo, read the technical papers AI Model Training and AI Inferencing at the Edge, or visit https://www.netapp.com/artificial-intelligence/.

Sathish Thyagarajan

Sathish joined NetApp in 2019. In his role, he develops solutions focused on AI at edge and cloud computing. He architects and validates AI/ML/DL data technologies, ISVs, experiment management solutions, and business use-cases, bringing NetApp value to customers globally across industries by building the right platform with data-driven business strategies. Before joining NetApp, Sathish worked at OmniSci, Microsoft, PerkinElmer, and Sun Microsystems. Sathish has an extensive career background in pre-sales engineering, product management, technical marketing, and business development. As a technical architect, his expertise is in helping enterprise customers solve complex business problems using AI, analytics, and cloud computing by working closely with product and business leaders in strategic sales opportunities. Sathish holds an MBA from Brown University and a graduate degree in Computer Science from the University of Massachusetts. When he is not working, you can find him hiking new trails at the state park or enjoying time with friends & family.

View all Posts by Sathish Thyagarajan

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion