Menu

NetApp automation helps cut COVID-19 transmission in the data center

call center
Table Of Contents

Share this page

jay-jayakrishnan
Jay Jayakrishnan
175 views

Today's world produces information at an astonishing rate. From COVID-19 statistics to TikTok videos, our appetite for information and entertainment is insatiable and the scrolls on our phones are infinite. This year alone, it is estimated that the planet will create and consume a staggering 97 zettabytes of data, an increase of 23% over 2021. Not all of this data is saved, but enough finds its way to disk or flash to put enormous pressure on storage vendors to keep up with demand.

At NetApp, rising to this challenge is in our DNA. We have always led the industry in performance and storage density, and we have also shown leadership with important innovations that simplify management of large-scale storage systems.

Betting on automation

Our investments in automation are a good example of our leadership in action. As storage scales massively, systems need to autonomously make decisions to optimize their delivery of service. After all, humans just don't scale very well.

Recently we introduced new technology for automating memory repairs. One of the reasons that NetApp® storage systems are both fast and reliable is that we make extensive use of caching and buffering in fast memory. Our latest flagship offering, the all-flash A900 system, has up to 1TB of DIMMs, which hold everything from the NetApp ONTAP® storage management core to data in motion.

Rebounding from failure

DIMMS can fail for a lot of reasons. Sometimes a chip has a bent pin, or it becomes unseated in its socket. Simply shipping a system to its destination can cause problems: At 30,000 feet, a single cosmic ray can knock out a memory cell for good.

Sometimes, though, a memory cell simply fails during regular operation. Usually these failures are caught and corrected with error correction codes. But a small fraction cannot be fixed, resulting in an uncorrectable error correction code (UECC) fault.

In the past, a UECC fault meant that a technician would have to replace the entire DIMM, a costly and time-consuming process. It's also wasteful because the problem may be a single bit in a 64GB memory stick.

But now, at NetApp we have incorporated Intel Post Package Repair (PPR) technology into our systems. PPR enables the system to self-heal in response to failures in memory. Most DIMM manufacturers include a few extra memory rows in their chips. PPR lets the system quickly swap out any rows with a memory error for one of these spares, so that the system can continue operating.

Keeping people in the loop

Good automation models know when to escalate decisions to a human, and ours is no exception. We notify operators when PPR is being applied so they can call a technician for a chip replacement if necessary. But our recent data shows that 85% of the time, the automatic row swap meets our customer's needs immediately.

Why does this matter? Over the last 10 months, this capability translated to 76 instances where a technician did not need to enter a customer's data center. And in the midst of a global pandemic, minimizing service visits isn't just good business; it's the right thing to do to control the spread of COVID-19.

Jay Jayakrishnan

Jay Jayakrishnan is director of Software Engineering for Platform Software & Firmware in the Core Storage Engineering group. He is responsible for execution leadership for platform software for systems running the world's largest storage operating system (ONTAP) and BIOS and baseboard management controller (BMC) firmware for all NetApp storage product lines. He leads a global team spread across multiple locations in the United States, India, Taiwan, and Ireland.

View all Posts by Jay Jayakrishnan

Next Steps