June 06, 2014
Vipul Mathur, Cijo George & Jayanta Basak
Performance problems are particularly hard to detect and diagnose in most computer systems, since there is no clear failure apart from the system being slow.
In this paper, we present an empirical, data-driven methodology for detecting performance problems in data storage systems, and aiding in quick diagnosis once a problem is detected.
The key feature of our solution is that it uses a combination of time-series analysis, domain knowledge and expert inputs to improve the overall efficacy. Our solution learns from a system's own history to establish the baseline of normal behavior. Hence it is not necessary to determine any static trigger-levels for metrics to raise alerts. Static triggers are ineffective since each system and its workloads are different from others. The method presented here (a) gives accurate indications of the time period when something goes wrong in a system, and (b) helps pin-point the most affected parts of the system to aid in diagnosis. Validation on more than 400 actual field support cases shows about 85% true positive rate with less than 10% false positive rate in identifying time periods of performance impact before or during the time a case was open. Results in a controlled lab environment are even better.
In Proceedings of the 30th International Conference on Massive Storage Systems and Technology (MSST 2014)
The definitive version of the paper can be found at: http://storageconference.us/2014/Papers/18.Anode.pdf
Presentation slides can also be found at: http://storageconference.us/2014/Presentations/Mathur.pdf