Date
December 18, 2013
Author
Haryadi Gunawi
The detection of "limping" hardware -- that is, hardware whose performance varies from its specification, is important to maintain performance, reliability, and availability of clustered Data ONTAP systems. The reports and anecdotes provided in the proposal remind of a quote attributed to Leslie Lamport: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." Clustered Data ONTAP is a distributed system; it employs many of its techniques such as remote-procedure calls (SpinNP), failure detectors, and consensus algorithms. A "limping" component on one controller can percolate to other controllers rendering the entire system unusable; even worse such "limping" hardware is undetected today which adds a burden to the NetApp support team. Knowing how to detect such hardware and correcting them dynamically will therefore be valuable to NetApp.