June 05, 2017
Rukma Talwadker, NetApp, Inc.
The 10th ACM International Systems and Storage Conference (SYSTOR2017) May 22-24, 2017 Haifa, Israel.
Misconfigurations in the storage systems can lead to business losses due to system downtime with substantial people resources invested into troubleshooting. Hence, faster troubleshooting of software misconfigurations has been critically important for the customers as well as the vendors.
This paper introduces a framework and a tool called Dexter, which embraces the recent trend of viewing systems as data to derive the troubleshooting clues. Dexter provides quick insights into the problem root cause and possible resolution by solely using the storage system logs. This differentiates Dexter from other previously known approaches which complement log analysis with source code analysis, execution traces etc.. Furthermore, Dexter analyzes command history logs from the sick system after it has been healed and predicts the exact command(s) which resolved the problem. Dexter's approach is simple and can be applied to other software systems with diagnostic logs for immediate problem detection without any pre-trained models.
Evaluation on 600 real customer support cases shows 90% accuracy in root causing and over 65% accuracy in finding an exact resolution for the misconfiguration problem. Results show up to 60% noise reduction in system logs and at least 10x savings in case resolution times, bringing down the troubleshooting times from days to minutes at times. Dexter runs 24x7 in the NetApp's® support data center.
The paper also presents insights from study on thousands of real customer support cases over thousands of deployed systems over the period of 1.5 years. These investigations uncover facts that cause potential delays in customer case resolutions and influence Dexter's design.
The definitive version of the paper can be found at: http://dl.acm.org/citation.cfm?id=3078484.