September 11, 2014
Atish Kathpal and Giridhar Yasa
Long term retention of data has become a norm for reasons like compliance and data preservation for future needs. With storage media continuing to become cheaper, this trend has further strengthened and is testified with introduction of archival solutions like Amazon Glacier and Spectra Logic BlackPearl.
On the other hand, analytics and big data have become key enablers for business and research. However, analytics and archiving happens on separate storage silos. This generates additional costs and inefficiencies when part of archived data needs to be analyzed using batch analytics platforms like Hadoop because a) We need additional storage for data transferred from archive to analytics tier and b) Transfer time costs are incurred due to data migration to analytics tier. Moreover, accessing archived data has high times to first byte, as much of the data is stored in offline media like tapes or spun down disks. We introduce Nakshatra, a data processing framework to run analytics directly on an archive based on offline media. To the best of our knowledge, this is the first work of its kind available in literature. We leverage batched pre-fetching and scheduling techniques for improved retrieval of data and scalable analytics on archives. Our preliminary evaluation shows Nakshatra to be upto 81% faster than the traditional ingest-then-compute workflow for archived data.
Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2014 IEEE 22nd International Symposium on