Sign in to my dashboard Create an account

Remzi Arpaci-Dusseau, University of Wisconsin-Madison - September 2022


September 2022


Remzi Arpaci-Dusseau

Traditional storage systems were designed and built for “important” data, much of which was human generated. For example, a filer might store Word documents that contain important corporate information, Excel spreadsheets for business and accounting purposes, and Powerpoint slide decks with important presentations within them. 

However, an increasing amount of data in the modern world is not generated by humans but is rather machine generated. For example, consider edge computing scenarios, with vast sensors networks and other inputs producing large volumes of data continuously. Even a single web camera can readily generated hundreds of GB of data per day; in a smart building scenario, with thousands of cameras (and other sensors), one can easily generate TBs of data daily, all of which could prove extremely valuable in security, environmental, and other building-wide applications.

This era, of vast amounts of data, much of it machine generated, enables us to ask fundamental questions about how we build modern, scalable storage systems. For example, are traditional interfaces, such as write() for data ingest and read() for retrieval, still sensible? Today, applications use these interfaces for all data types, whether the most crucial information within an organization or machine-generated data; what should interfaces to storage be, when most data is machine generated? 

Furthermore, how should each byte in such a storage system be treated? Traditional approaches obliviously treat each byte in a uniform way. For example, when storing data into a high-end storage system, the system will likely keep many copies (or some coded format) of the data for reliability, and also perform various other techniques (e.g., checksums) to ensure that the data written is indeed the same data that is returned at a later time. For valuable data, such precaution is sensible; for machine-generated information, perhaps less so.

To address these questions, and to realize a new generation of storage systems for the modern, data-driven era, we propose to investigate Storage Systems for Machine-Generated Data (or, MGD Storage for short). MGD Storage generally take in large amounts of raw, unprocessed data, but then outputs only much smaller, derived forms of said data. MGD Storage can do so because of built-in intelligence; they understand data content (perhaps in a limited manner), and thus can perform computation when storing the data. This computation is the key to MGD Storage efficiency; instead of storing large amounts of raw, unprocessed data (perhaps most of which is not useful), MGD Storage can perform data reduction, filtering, and other intelligent techniques on the fly. When applications later request data, they do not directly read previously stored information; rather, through new access interfaces, they request the results of said processing.

The potential benefits of MGD Storage are large. The most obvious benefit is cost reduction due to capacity savings; instead of storing large amounts of (potentially machine-generated) data, an MGD Storage System can keep derived data forms useful for the given application or system context, thus potentially cutting down storage capacity costs by many orders of magnitude. A second benefit is performance; by pushing computation into the write path, applications may be able to more quickly access derived data when needed, lowering latency and generally improving performance. 

We will investigate MGD Storage within the context of LambdaStore, a new object-based storage system we plan to build to serve as a research prototype for this work. LambdaStore borrows idea from “serverless” computing to enable intelligence directly embedded within storage. Specifically, each data object is associated with one or more “lambdas”, i.e., little bits of code which can run at opportune moments in order to perform the needed tasks. Such a platform is general enough to allow the broad investigation of the key ideas behind MGD Storage. 

There are a broad set of research questions that we will address as part of this work. Part will be architectural: how should LambdaStore be designed to enable the full potential of MGD Storage? Part will be policy based: how should load be managed? When should lambdas be scheduled to be run? How should objects be placed to reduce communication?

Drift chat loading