November 18, 2016
When distributed systems fail in the field, identifying the root cause and pinpoint the faulty software component or machine can be extremely hard and time-consuming. This research aims to provide an end-to-end solution to automate the diagnosis of production failures on distributed software stack solely using the unstructured logs output. It builds this in three parts. First, it aims to design a new postmortem diagnosis tool to automatically reconstruct the extensive domain knowledge of the programmers who wrote the code; it does this by relying on the principle ("Flow Reconstruction Principle”) that programmers log events such that one can reliably reconstruct the execution flow a posteriori. However, any postmortem debugging relying on log output hinges on the efficacy of such logging. This is the focus of the second part, which is, to measure the quality of software’s log output. Finally, they intend to use this measurement to ultimately automate software logging itself.