April 22, 2015
Distributed cloud software infrastructures (i.e., cloud systems) have emerged as a dominant backbone for many modern applications. Cloud systems such as scale-out storage systems, computing frameworks, synchronization services, and cluster management services become “the operating system” of cloud computing, and thus users expect high reliability from these systems. One of the major challenges faced by cloud systems is that they can be deployed at a very large scale (e.g., hundreds to thousands of nodes). However, prior to deployment, developers cannot test their systems at such scale. Thus, we find that cloud systems are prone to scalability bugs, that is, bugs that only appear when the system is deployed at scale. Scalability bugs are not catchable in small-scale testing. However, in deployment, they cause catastrophic implications. Our proposal seeks to answer this fundamental question: How can we ensure, prior to deployment, that a distributed system can run correctly at scale? We believe we should transform existing scale-out distributed cloud systems into scale-checkable systems. That is, prior to deployment, we can test the scalability properties of a distributed system without the need to run the system at scale. In other words, scale-checkable systems can reveal scalability bugs even if we test them on few nodes. We will develop scale-checkable design principles which later can be transferred into other scalable distributed systems.