Contact Sales
Welcome!

An account will enable you to access:
- NetApp support's essential features
- NetApp communities
- NetApp training
- Sign in to my dashboard
- Don't have an account?
  Create an account
- BlueXP is now NetApp Console
  
  Monitor and run hybrid cloud data services
  NetApp Console
NetApp account
Language
- English
- Deutsch
- Español
- Français
- Italiano
- Português
- 日本語
- 한국어
- 简体中文
- 繁體中文
See your global contacts
Learn
Browse

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Date

October 06, 2018

Author

Haryadi S. Gunawi and Riza O. Suminto, University of Chicago; Russell Sears and Casey Golliher, Pure Storage; Swaminathan Sundararaman, Parallel Machines; Xing Lin and Tim Emami, NetApp; Weiguang Sheng and Nematollah Bidokhti, Huawei; Caitie McCaffrey, Twitter; Gary Grider and Parks M. Fields, Los Alamos National Laboratory; Kevin Harms and Robert B. Ross, Argonne National Laboratory; Andree Jacobson, New Mexico Consortium; Robert Ricci and Kirk Webb, University of Utah; Peter Alvaro, University of California, Santa Cruz, Mingzhe Hao, Huaicheng Li, and H. Birali Runesha, University of Chicago

ACM Transactions on Storage (TOS) TOS Volume 14 Issue 3, October 2018 Article No. 23 ,

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

Resources

A copy of the paper can be found at: https://doi.org/10.1145/3242086.