July 16, 2011
Recent years have seen the widespread use of large-scale content-based systems, such as, wide-area accelerators, caches, data de-duplication systems, content-based routers, and multimedia similarity detectors. These systems crucially depend on the use of indexes for rapidly locating content of interest. Since the underlying base of content that these systems compute over is large (typically several terabytes to petabytes), the indexes are also large, ranging from a few 10s to 100s of GB. Existing approaches for maintaining such indexes, such as using DRAM or disk-based techniques, are either too expensive or too slow. In this research, we consider a suite of index designs that leverage flash-based storage (e.g., SSDs) to offer high-performance at low cost, and wide-ranging flexibility to meet the demands of a variety of content-based systems. Our goal is to design schemes that are 1-2 orders of magnitude better than the competition in terms of index operations per second per dollar.We build on our prior work on designing fast flash-based indexes for wide-area acceleration. In prior work, we showed that workloads where reads and writes are both frequent can be supported by augmenting flash with a small amount of DRAM and adopting a combination of buffering and lazy updates to support high performance at low cost. We also showed how to support simple eviction under these workloads.The proposed research expands our prior work in significant new directions and to several new content-based systems. In our research plan, we first consider how to design the appropriate data structures to handle different mixtures of reads and writes at the index. We then consider how to handle updates, deletions and streaming-style evictions from the index in a flexible and efficient fashion under different workloads. We consider how to modify the designs to accommodate hierarchical keys, as opposed to flat hash-based keys. We also consider data structures to support approximate nearest-neighbor matches over the keys, as opposed to exact matches. In all cases, we consider a two-tier set-up where the index spans a small amount of DRAM and a much larger amount of flash-based storage.