July 01, 2013
P.C. Nagesh and Atish Kathpal.
By experimenting on real-world datasets, we show that Rangoli is much more efficient in space reclamation than are strategies such as those based on finding unique files or that use MinHash.
Space management is the activity of monitoring and ensuring adequate free space on all volumes in a clustered storage system. Volumes that exceed used space limits are typically relieved by migrating a part of their data to other underutilized volumes. Without deduplication, space reclamation is simple as one has to just migrate as much data as the desired space reclamation. However, in deduped volumes there is no direct relation between the logical size of the file and the physical space occupied by it. Therefore, optimal space reclamation is hard as: a) migrating few files may produce little or zero bytes of free space, but still incur significant network costs. b) migrating a heavily shared file destroys the disk sharing relationships in that volume and increases the physical space consumption of that dataset.
In this work, we have designed and built a fast and efficient tool Rangoli, that identifies the optimal set of files for space reclamation in a deduped environment. It can scale to millions of files and terabytes of data, running in tens of minutes. We show by experimenting on real world datasets, that alternate strategies such as those based on finding unique files or using MinHash, impact physical space consumption by a wide margin (up to 35 times) as compared to Rangoli.
In Proceedings of the 6th International Systems and Storage Conference (SYSTOR ’13)
The author’s version of the paper is attached to this posting. Please observe the following copyright: © ACM, 2013. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the 6th International Systems and Storage Conference (SYSTOR ’13), https://dl.acm.org/citation.cfm?id=2485744&dl=ACM&coll=DL&CFID=232291986&CFTOKEN=21621091.
The definitive version of the paper can be found at: https://dl.acm.org/citation.cfm?id=2485744&dl=ACM&coll=DL&CFID=232291986&CFTOKEN=21621091.
Presentation slides are also attached to this posting.