As we reached the crucial general availability (GA) milestone this month, the final stages of our NetApp® AIPod™ Mini with Intel launch, I reflected on two customer-centric questions that the solution addresses. The first question concerns the AI models with fewer parameters and the recent developments in the world of tokenization, and the second is associated with AI infrastructure costs and the economics of deploying large language models (LLMs).
When it comes to AI inferencing and LLM deployments, not all your AI workloads or business use cases require the powerful capabilities of LLMs with a model size of 20+ billion parameters. Small language models (SLMs) have been emerging as a powerful and practical alternative, especially for your tasks that require specific domain expertise and for any resource-constrained environments that you might have. When SLMs are fine-tuned with tailored datasets, evidence-based research suggests that these models tend to outperform larger, general-purpose models in specific tasks and domains, such as medical diagnostics, algorithmic trading and legal analysis.
One of my findings is a recent development called VeriGen, an AI model that has been fine-tuned on an open-source CodeGen-16B language model that contains 16 billion parameters to generate Verilog code. Verilog is a hardware description language (HDL) that’s used for design automation in the semiconductor and electronics industry. Furthermore, like the CEO of the AI startup Hugging Face once suggested, up to 99% of use cases could be addressed by using SLMs. With SLMs demonstrating capabilities that back up the growing notion that “small is the next big thing in AI,” your AI engineers have new architectural considerations for the retrieval-augmented generation (RAG) deployment options available, including those that also combine SLMs with LLMs in the form of a hybrid AI solution.
The choice between CPUs and GPUs also depends on the specific requirements of your AI workload. Although some AI applications benefit more from the parallel processing capabilities of GPUs, your organization might prioritize low latency, which CPUs can provide. Moreover, procurement of GPUs can sometimes be a challenge, leading to hardware availability issues and supply-chain constraints in addition to your organization’s budget considerations. These constraints then slow your organization in reaching the finish line that you’re working toward in your AI product development and deployment lifecycle. A thorough understanding of your organization’s workload characteristics and performance requirements leads to sound AI design decisions and prevents overengineering a solution based on unnecessary complexities.
The collaboration between NetApp and Intel has focused on these perspectives and customer pain points with a product strategy that includes feasibility and viability through NetApp AIPod Mini—which is now generally available. This RAG system was designed for air-gapped AI inferencing workloads without the need for GPUs.
We’re announcing the GA of NetApp AIPod Mini at a time when a growing number of organizations are using RAG applications and LLMs to interpret user prompts. These prompts and responses can include text, code, images, and even information like therapeutic protein structures that are retrieved from an organization’s internal knowledge base. RAG accelerates knowledge retrieval and efficient literature review by quickly providing your researchers and business leaders with relevant and reliable information.
AIPod Mini combines NetApp intelligent data infrastructure, composed of NetApp AFF A-Series systems powered by NetApp ONTAP® data management software, and Intel-based compute servers. These servers include Intel® Xeon® 6 processors with Intel® Advanced Matrix Extensions (Intel® AMX), Intel AI for Enterprise RAG, and the Open Platform for Enterprise AI (OPEA) software stack.
NetApp AIPod Mini supports pretrained models of sizes up to 20 billion parameters (for example, LLaMA-13B, DeepSeek-R1-8B, Qwen-14B, and Mistral 7B). Intel® AMX accelerates AI inferencing on a combination of data types (for example, INT4, INT8, and BF16). NetApp AIPod Mini was jointly tested by NetApp and Intel by using optimization techniques like activation-aware weight quantization (AWQ) for accuracy and speculative decoding for inference speed. It delivers up to 2,000 I/O tokens for 30+ concurrent users with 500+ tokens per second (TPS), thus balancing the trade-off between speed and accuracy for superior user experience. You can find the benchmark results that were released on MLPerf Inference 5.0.
Following are just a few advantages of running a RAG system with NetApp AIPod Mini:
RAG systems and LLMs are technologies that work together to provide accurate and context-aware responses retrieved from your organization’s internal knowledge repository. NetApp is a leader in data management, data mobility, data governance, and data security technologies across the ecosystem of edge, data center, and cloud. With NetApp AIPod Mini, you can take advantage of those technologies while relying on industry-leading NetApp and Intel solutions. NetApp AIPod Mini with Intel delivers an air-gapped RAG inferencing pipeline to help your enterprise deploy generative AI technologies with significantly less compute power while boosting your business productivity.
To learn more about the solution design and validation, review the technical report NetApp AIPod Mini: Enterprise RAG inferencing with NetApp and Intel. You can also learn more about how NetApp AI solutions can enhance performance, productivity, and protection for your AI workloads.
Sathish joined NetApp in 2019. In his role, he develops solutions focused on AI at edge and cloud computing. He architects and validates AI/ML/DL data technologies, ISVs, experiment management solutions, and business use-cases, bringing NetApp value to customers globally across industries by building the right platform with data-driven business strategies. Before joining NetApp, Sathish worked at OmniSci, Microsoft, PerkinElmer, and Sun Microsystems. Sathish has an extensive career background in pre-sales engineering, product management, technical marketing, and business development. As a technical architect, his expertise is in helping enterprise customers solve complex business problems using AI, analytics, and cloud computing by working closely with product and business leaders in strategic sales opportunities. Sathish holds an MBA from Brown University and a graduate degree in Computer Science from the University of Massachusetts. When he is not working, you can find him hiking new trails at the state park or enjoying time with friends & family.