Menu

Powering enterprise RAG systems and AI inferencing

people in meeting
Contents

Share this page

Sathish Thyagarajan
Sathish Thyagarajan
460 views

As we reached the crucial general availability (GA) milestone this month, the final stages of our NetApp® AIPod™ Mini with Intel launch, I reflected on two customer-centric questions that the solution addresses. The first question concerns the AI models with fewer parameters and the recent developments in the world of tokenization, and the second is associated with AI infrastructure costs and the economics of deploying large language models (LLMs).

Is small the next big thing in AI?

When it comes to AI inferencing and LLM deployments, not all your AI workloads or business use cases require the powerful capabilities of LLMs with a model size of 20+ billion parameters. Small language models (SLMs) have been emerging as a powerful and practical alternative, especially for your tasks that require specific domain expertise and for any resource-constrained environments that you might have. When SLMs are fine-tuned with tailored datasets, evidence-based research suggests that these models tend to outperform larger, general-purpose models in specific tasks and domains, such as medical diagnostics, algorithmic trading and legal analysis.  

One of my findings is a recent development called VeriGen, an AI model that has been fine-tuned on an open-source CodeGen-16B language model that contains 16 billion parameters to generate Verilog code. Verilog is a hardware description language (HDL) that’s used for design automation in the semiconductor and electronics industry. Furthermore, like the CEO of the AI startup Hugging Face once suggested, up to 99% of use cases could be addressed by using SLMs. With SLMs demonstrating capabilities that back up the growing notion that “small is the next big thing in AI,” your AI engineers have new architectural considerations for the retrieval-augmented generation (RAG) deployment options available, including those that also combine SLMs with LLMs in the form of a hybrid AI solution. 

Do all AI workloads need GPUs?

The choice between CPUs and GPUs also depends on the specific requirements of your AI workload. Although some AI applications benefit more from the parallel processing capabilities of GPUs, your organization might prioritize low latency, which CPUs can provide. Moreover, procurement of GPUs can sometimes be a challenge, leading to hardware availability issues and supply-chain constraints in addition to your organization’s budget considerations. These constraints then slow your organization in reaching the finish line that you’re working toward in your AI product development and deployment lifecycle. A thorough understanding of your organization’s workload characteristics and performance requirements leads to sound AI design decisions and prevents overengineering a solution based on unnecessary complexities.

Announcing the general availability of NetApp AIPod Mini

The collaboration between NetApp and Intel has focused on these perspectives and customer pain points with a product strategy that includes feasibility and viability through NetApp AIPod Mini—which is now generally available. This RAG system was designed for air-gapped AI inferencing workloads without the need for GPUs. 

We’re announcing the GA of NetApp AIPod Mini at a time when a growing number of organizations are using RAG applications and LLMs to interpret user prompts. These prompts and responses can include text, code, images, and even information like therapeutic protein structures that are retrieved from an organization’s internal knowledge base. RAG accelerates knowledge retrieval and efficient literature review by quickly providing your researchers and business leaders with relevant and reliable information. 

Netapp aipod mini

AIPod Mini combines NetApp intelligent data infrastructure, composed of NetApp AFF A-Series systems powered by NetApp ONTAP® data management software, and Intel-based compute servers. These servers include Intel® Xeon® 6 processors with Intel® Advanced Matrix Extensions (Intel® AMX), Intel AI for Enterprise RAG, and the Open Platform for Enterprise AI (OPEA) software stack.  

NetApp AIPod Mini supports pretrained models of sizes up to 20 billion parameters (for example, LLaMA-13B, DeepSeek-R1-8B, Qwen-14B, and Mistral 7B). Intel® AMX accelerates AI inferencing on a combination of data types (for example, INT4, INT8, and BF16). NetApp AIPod Mini was jointly tested by NetApp and Intel by using optimization techniques like activation-aware weight quantization (AWQ) for accuracy and speculative decoding for inference speed. It delivers up to 2,000 I/O tokens for 30+ concurrent users with 500+ tokens per second (TPS), thus balancing the trade-off between speed and accuracy for superior user experience. You can find the benchmark results that were released on MLPerf Inference 5.0

Following are just a few advantages of running a RAG system with NetApp AIPod Mini: 

  • NetApp ONTAP data management provides enterprise-grade storage to support various types of AI workloads, including batch and real-time inferencing, and offers velocity and scalability to handle your large datasets for versioning. You also get data access with multiprotocol support. Your client AI applications can read data by using the S3, NFS, and SMB file-sharing protocols, which can facilitate data access in multimodal LLM inference scenarios. Built-in NetApp Autonomous Ransomware Protection (ARP) delivers data protection and confidentiality, offering both software- and hardware-based encryption to enhance confidentiality and security for your RAG applications that retrieve knowledge from your company’s document repositories. 
  • In addition to a data pipeline that’s powered by NetApp intelligent data infrastructure, you get OPEA for Intel® AI for Enterprise RAG. OPEA simplifies the transformation of your enterprise data into actionable insights. Intel AI for Enterprise RAG offers key features that enhance scalability, security, and your user experience. OPEA includes a comprehensive framework that features LLMs, datastores, prompt engines, and RAG architectural blueprints
chart showing data

Start boosting productivity with less compute power

RAG systems and LLMs are technologies that work together to provide accurate and context-aware responses retrieved from your organization’s internal knowledge repository. NetApp is a leader in data management, data mobility, data governance, and data security technologies across the ecosystem of edge, data center, and cloud. With NetApp AIPod Mini, you can take advantage of those technologies while relying on industry-leading NetApp and Intel solutions. NetApp AIPod Mini with Intel delivers an air-gapped RAG inferencing pipeline to help your enterprise deploy generative AI technologies with significantly less compute power while boosting your business productivity.

To learn more about the solution design and validation, review the technical report NetApp AIPod Mini: Enterprise RAG inferencing with NetApp and Intel. You can also learn more about how NetApp AI solutions can enhance performance, productivity, and protection for your AI workloads.

Sathish Thyagarajan

Sathish joined NetApp in 2019. In his role, he develops solutions focused on AI at edge and cloud computing. He architects and validates AI/ML/DL data technologies, ISVs, experiment management solutions, and business use-cases, bringing NetApp value to customers globally across industries by building the right platform with data-driven business strategies. Before joining NetApp, Sathish worked at OmniSci, Microsoft, PerkinElmer, and Sun Microsystems. Sathish has an extensive career background in pre-sales engineering, product management, technical marketing, and business development. As a technical architect, his expertise is in helping enterprise customers solve complex business problems using AI, analytics, and cloud computing by working closely with product and business leaders in strategic sales opportunities. Sathish holds an MBA from Brown University and a graduate degree in Computer Science from the University of Massachusetts. When he is not working, you can find him hiking new trails at the state park or enjoying time with friends & family.

View all Posts by Sathish Thyagarajan

Next Steps

Drift chat loading