원문정보
초록
영어
MapReduce has become an important distributed processing model for large-scale data-intensive applications like data mining and web indexing. Hadoop–an open-source implementation of MapReduce is widely used for short jobs requiring low response time. In this paper, we proposed a new preshuffling strategy in Hadoop to reduce high network loads imposed by shuffle-intensive applications. Designing new shuffling strategies is very appealing for Hadoop clusters where network interconnects are performance bottleneck when the clusters are shared among a large number of applications. The network interconnects are likely to become scarce resource when many shuffle-intensive applications are sharing a Hadoop cluster. We implemented the push model along with the preshuffling scheme in the Hadoop system, where the 2-stage pipeline was incorporated with the preshuffling scheme. We implemented the push model and a pipeline along with the preshuffling scheme in the Hadoop system. Using two Hadoop benchmarks running on the 10-node cluster, we conducted experiments to show that preshuffling-enabled Hadoop clusters are faster than native Hadoop clusters. For example, the push model and the preshuffling scheme powered by the 2-stage pipeline can shorten the execution times of the WordCount and Sort Hadoop applications by an average of 10% and 14%, respectively.
목차
1. Introduction
1.1. Shuffle-Intensive Hadoop Applications
1.2. Alleviate Network Load in the Shuffle Phase
1.3. Benefits and Challenges of the Preshuffling Scheme
1.4. Organization
2. Background
2.1. MapReduce Overview
2.2. Hadoop Distributed File System
3. Design Issues
3.1. Push Model of the Shuffle Phase
3.2. A Pipeline in Preshuffling
3.3. In-memory Buffer
4. Implementation
5. Evaluation Performance
5.1. Experimental Environment
5.2. In Cluster
5.3. Large Blocks vs. Small Blocks
6. Related work
7. Conclusion
Acknowledgments
References