December 18, 2013
The complexity of today's cloud systems makes it increasingly difficult to deliver bulletproof software services. Case in point: the large numbers and impact of recent failures from cloud software service providers. This proposed research has the following three main goals. First, it will provide a comprehensive study of failure characteristics in cloud-based distributed software systems. Key questions answered by this study will include: How do large-scale software failures compare to single component software failures? Why do some non-fatal component failures propagate into service-level catastrophic failures? Are there testing opportunities to prevent these catastrophic failures? Is there sufficient evidence (e.g., logs, core dumps) for postmortem diagnosis? Are logs in large-scale software systems too noisy? Based on findings in the study, the second goal is to design practical tools to automate the diagnosis of failures in cloud-based distributed software systems by leveraging logs and source code. Finally, a daunting challenge to diagnosing any complex system is the presence of a huge amount of noisy log messages. Hence, the third goal of this research is to ease problem diagnosis by developing techniques to sift through the noisy messages and find the relevant evidences.