Date
July 15, 2011
Author
George Porter
The project proposes to investigate the effects of networked storage on the processing of data intensive scalable computing (aka. big data). It has been observed that clusters for processing big data are built to be very efficient, but no attention has been paid to how data gets into or out of the cluster; i.e., the sources and sinks of the data located elsewhere within the enterprise. The project is following the idea that there are efficiencies to be gained by looking at the full ecosystem wherein data lives on networked storage in the enterprise and must (currently) be loaded onto the cluster for processing. This work will be done in the context of the UCSD TritonSort project which is a framework that recently won the records for certain GraySort benchmarks.