Menu

Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

Date

April 02, 2011

Author

Ian Adams, Ethan L. Miller, and Mark W. Storer

This paper discusses our examination of the workload behavior of several scientific and historical archives to provide relevant input for the design of effective long-term data storage systems.

The scope of archival systems is expanding beyond cheap tertiary storage: scientific and medical data is increasingly digital, and the public has a growing desire to digitally record their personal histories. Driven by the increased cost efficiency of hard drives compared to tape, and the rise of the Internet, content archives have become a means of providing the public with fast, cheap access to long-term data. Unfortunately, designers of purpose-built archival systems are either forced to rely on workload behavior obtained from a narrow, anachronistic view of archives as simply cheap tertiary storage, or extrapolate from marginally related enterprise workload data and traditional library access patterns.

To provide relevant input for the design of effective long-term data storage systems, we examined the workload behavior of several scientific and historical archives, covering a mixture of purposes, media types, and access models. Our findings show that, for scientific archival storage, files have become larger, but update rates have remained largely unchanged. However, in public content archives, we observed behavior that diverges from the traditional “write-once, read-maybe” behavior of tertiary storage. Our study shows that the majority of such data is modified relatively frequently, and that indexing services such as Google and internal data management processes may routinely access large portions of an archive, accounting for most of the accesses. Based on these observations, we identify areas for improving the efficiency and performance of archival storage systems.

Published as Storage Systems Research Center (SSRC) Technical Report UCSC-SSRC-11-01

Resources

This technical report is available on the University of California, Santa Cruz, SSRC site: https://www.ssrc.ucsc.edu/pub/adams-ssrctr-11-01.html.