Date
September 11, 2014
Author
Srinivasan Narayanamurthy, Kartheek Muthyala and Gaurav Makkar.
The recent increase in interest for batch analytics has resulted in extensive use of distributed frameworks such as Hadoop and Dryad. Batch analytics—as the name suggests, perform many computations on large volumes of data.
That is, large quantities of data are ingested once and read many times mostly in large chunks, which is characterized as write-once read- many (WORM) workload. The storage part of these distributed frameworks (say, HDFS in Hadoop) use file systems such as ext4 or XFS as native object stores to store objects as files in individual nodes of the distributed system. These general purpose file systems were designed with broader goals such as POSIX-compliance, optimal performance for a wide range of file size, user friendliness, etc. However, most of these features are not required for a native object store in distributed file systems.
WORMStore is a light weight object store that is designed exclusively for use in distributed systems for WORM workload. WORMStore provides interesting advantages such as the ability to prefetch large objects, small metadata to data ratio, media aware data/metadata placement, etc. As WORMStore is log-structured, it provides the ability to recover upon failure. Our experiments show that WORMStore provides a 28% increase in the read throughput per node in a Hadoop cluster. The author’s version of the paper is attached to this posting. Please observe the following copyright: © 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.