Date
August 08, 2014
Author
Patrick Eugster
Big data processing represents one of the major challenges of this era. Cloud computing technologies offer a partial response to the computation and storage resource needs of big data processing. However, in contrast to the illusion of omni-present uniform storage and computation resources promoted by many cloud providers, clouds are implemented by concrete datacenters with specific locations; big data is often geographically distributed (geo-distributed) or even partitioned across datacenters (or zones, sub-networks), and inter-datacenter access times differ strongly from intra-datacenter ones. Despite the substantial impact of communication costs on response times in big data applications – up to 50% even in single-datacenter setups – existing systems deployed in clouds work poorly across datacenters, e.g., by naïvely copying all data first ``to the mothership'', if they support such setups at all. Existing work on geo-distributed clouds considers data storage but not the problem of computing complex queries on such data. We consider a setup where big datasets are distributed or even partitioned across datacenters, and we want to execute queries across these datasets, exploiting parallelization. Rather than naïvely copying all data to one datacenter before running a query our goal intuitively is to move parts of such a query towards the respective data, and make smart decisions on where and when to consolidate data as needed by the query. Given that inter-datacenter communication is more costly (time-wise, and financially) than intra-data center communication, and that the amount of data in big data jobs commonly decreases as computation proceeds,consolidating data as late as possible is generally beneficial.