February 08, 2013
Madalin Mihailescu, University of Toronto and NetApp; Gokul Soundararajan, NetApp; Cristiana Amza, University of Toronto
MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems.
Abstract: Distributed ﬁle systems built for data analytics and enterprise storage systems have very different functionality requirements. For this reason, enabling analytics on enterprise data commonly introduces a separate analytics storage silo. This generates additional costs, and inefﬁciencies in data management, e.g., whenever data needsto be archived, copied, or migrated across silos. MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems. The front-end caching layer enables the local storage performance required by data analytics. The shared storage back-end simpliﬁes data management.
We evaluate MixApart using a 100-core Amazon EC2 cluster with micro-benchmarks and production workload traces. Our evaluation shows that MixApart provides i) up to 28% faster performance than the traditional ingest then-compute workﬂows used in enterprise IT analytics, and ii) comparable performance to an ideal Hadoop setup without data ingest, at similar cluster sizes.
In Proceedings of the USENIX Conference on File and Storage Technologies (FAST'13), February 2013.
A copy of the paper is attached to this posting. The definitive version of the paper can be found at: https://www.usenix.org/conference/fast13/mixapart-decoupled-analytics-shared-storage-systems.