April 15, 2011
The project aims to quantify the trade-offs between using local disks versus NAS for supporting Hadoop workloads, and identifying bottlenecks and scenarios under which one approach is beneficial over the other. Next, we plan to leverage these findings to investigate different storage-compute configurations. The overall goal is to study how NAS can be employed to complement Hadoop systems. Moreover, we are interested in identifying applications’ characteristics that make them a better fit for a NAS-enabled Hadoop environment versus the original Hadoop setup.A research challenge is to capture how integration of different storage system designs in Hadoop affect overall performance, and how the storage system can be tailored to meet the I/O demands of applications. Exploring such design decisions using real setups is ideal. Unfortunately, limited resources and very long time-to-solution render such real evaluation impractical. An alternative is to evaluate the new system design in a simulator that accurately captures the behavior of Hadoop and its various parameters and configurations. We have built such a simulator, MRPerf, which can simulate performance of Hadoop applications, when provided with infrastructure specification and workload characteristics.We will design, develop, implement, and evaluate different storage designs for Hadoop setups through simulation. We will study and evaluate different scenarios of storage integration with Hadoop. We will add a storage device model (based on feedback from NetApp) in the MRPerf simu- lator. This will include extending MRPerf and developing a new interface for a configurable storage device, with parameters such as I/O bandwidth and latency. We aim to capture critical interactions such as network contention, processor contention and storage contention to a reasonable extent. The results of such simulations can provide significant insights about the system and enable design of suitable real storage devices for Hadoop.