Bianca Schroeder, University of Toronto - July 2010


July 18, 2010


Bianca Schroeder

A Unified Framework for Managing Storage System Reliability

As the loss of data can have devastating consequences for many businesses, one of the most important aspects in designing enterprise storage systems is reliability. Designing and configuring storage systems that minimize the chance of data loss requires a good understanding of how different design or configuration choices and future technological trends affect system reliability.Unfortunately, the details of many key aspects of storage system reliability are not well understood. This includes for example the effect of environmental parameters, such as temperature, the effect of the type of workload and workload intensity, the degree of correlation between different failures or error events, and the reliability characteristics of new media such as solid state drives. As a results, much existing work on storage system reliability either ignores many of those factors, or relies on simplistic assumptions that do not reflect the real world.The research work takes a three-pronged approach at overcoming many of the above problems: (1) We plan to collect and analyze field data from production storage systems that will allow us to study those aspects of storage system reliability that are particularly poorly understood; (2) We will use the results of our data analysis to derive more realistic models and simulation environments for evaluating storage reliability; (3) We will use our models to answer some frequently asked questions about storage system reliability and make projections for future challenges in building reliable storage systems.We expect that the tools and the insights derived from our work will be useful to both practitioners involved in configuring and running large-scale storage systems, as well as the designers of next generation storage systems.