December 18, 2013
The intelligence embedded in many modern Web sites and user-facing applications—e.g., recommendation of new friends or news articles of potential interest, selection and placement of advertisements, and matching callers to technical support personnel—is driven by data-intensive analytics. The input data for the analytics comes continuously from dozens of different sources on user-facing systems like key-value stores, databases, and logging services. The results computed from the analytics are loaded back in near real-time for use by applications. Terabytes of data may go through these data cycles per day. The goal of our project is to develop a comprehensive system, called Cyclops, to support these data cycles easily, efficiently, and in ways that scale automatically as data sizes increase or the cycle latencies desired by applications change. We are doing a comprehensive study of real-life data cycles in order to develop workloads for deeper analysis and benchmarking. Our initial studies indicate that these data cycles comprise workflows containing operations with different types of behavior; necessitating a careful mix and match of systems such as key-value stores, data stream processing systems, and batch analytics systems like MapReduce in order to run the workflows efficiently. Considerable manual effort goes into creating and tuning these multi-system workflows today. The goals of Cyclops are to provide a declarative language for easy specification of these multi-system workflows, enable automatic scheduling and optimization, enable coordinated provisioning of compute and storage resources to various systems in an application-centric manner, as well as exploit opportunities for graceful load shedding under load spikes.