MapReduce is a programming model and an implementation for processing large datasets on large clusters. It allows users to specify map and reduce functions, which are then automatically parallelized by the runtime system across a large cluster. The system handles machine failures, schedules machine-to-machine communication efficiently, and provides a user-friendly interface. Over the past four years, more than 10,000 different MapReduce programs have been implemented at Google, with over 100,000 jobs executed daily, processing over 20 petabytes of data per day.
Before MapReduce, Google engineers had to implement hundreds of specialized processing routines for large unprocessed data, such as crawled documents and web request logs. These data were used to compute various derived data, such as inverted indexes, web document graph structures, and query frequency sets. While these computations were theoretically simple, the large input data required distributed processing across hundreds to thousands of machines. The complexity of handling parallel computation, data distribution, and fault tolerance led to complicated code, making the original simple processing difficult to understand.
To address this complexity, the authors introduced a new abstraction that hides the details of parallelization, fault tolerance, data distribution, and load balancing. This abstraction is inspired by the map and reduce primitives in functional languages like Lisp. The authors realized that most of their computations involved applying a map operation to each logical record of input, generating a set of intermediate key/value pairs, and then applying a reduce operation to all values associated with the same key to derive the final output. This functional model allows for easy parallelization of large computations and enables fault tolerance through re-execution.
The main contributions of this research are a simple and powerful interface for automatic parallelization and distribution of large computations, and the implementation of this interface on commodity PC clusters. The programming model can also be used for parallelization on multi-core machines. Section 2 describes the programming model, with examples. Section 3 discusses the implementation of the MapReduce interface on a large cluster. Section 4 presents some extensions of the programming model. Section 5 reports performance measurements on various tasks. Section 6 discusses the experience of using MapReduce at Google, particularly in rewriting production indexing graph systems. Section 7 discusses related research and future challenges.
The MapReduce programming model is based on key-value pairs as input and output. Users define map and reduce functions to express their computational tasks. The map function processes input key-value pairs and generates intermediate key-value pairs. The reduce function processes intermediate key-value pairs and produces final output key-value pairs. The model is flexible, allowing for various data types and custom input/output formats. The system is implemented on a large cluster of commodity PCs, with a gigabit Ethernet network and a distributed file system (GFS) for storage. The system handles machine failures by re-executing tasks and ensuring fault tolerance through atomic commit operations. It also optimizes locality by scheduling tasks on machines with local copies of input dataMapReduce is a programming model and an implementation for processing large datasets on large clusters. It allows users to specify map and reduce functions, which are then automatically parallelized by the runtime system across a large cluster. The system handles machine failures, schedules machine-to-machine communication efficiently, and provides a user-friendly interface. Over the past four years, more than 10,000 different MapReduce programs have been implemented at Google, with over 100,000 jobs executed daily, processing over 20 petabytes of data per day.
Before MapReduce, Google engineers had to implement hundreds of specialized processing routines for large unprocessed data, such as crawled documents and web request logs. These data were used to compute various derived data, such as inverted indexes, web document graph structures, and query frequency sets. While these computations were theoretically simple, the large input data required distributed processing across hundreds to thousands of machines. The complexity of handling parallel computation, data distribution, and fault tolerance led to complicated code, making the original simple processing difficult to understand.
To address this complexity, the authors introduced a new abstraction that hides the details of parallelization, fault tolerance, data distribution, and load balancing. This abstraction is inspired by the map and reduce primitives in functional languages like Lisp. The authors realized that most of their computations involved applying a map operation to each logical record of input, generating a set of intermediate key/value pairs, and then applying a reduce operation to all values associated with the same key to derive the final output. This functional model allows for easy parallelization of large computations and enables fault tolerance through re-execution.
The main contributions of this research are a simple and powerful interface for automatic parallelization and distribution of large computations, and the implementation of this interface on commodity PC clusters. The programming model can also be used for parallelization on multi-core machines. Section 2 describes the programming model, with examples. Section 3 discusses the implementation of the MapReduce interface on a large cluster. Section 4 presents some extensions of the programming model. Section 5 reports performance measurements on various tasks. Section 6 discusses the experience of using MapReduce at Google, particularly in rewriting production indexing graph systems. Section 7 discusses related research and future challenges.
The MapReduce programming model is based on key-value pairs as input and output. Users define map and reduce functions to express their computational tasks. The map function processes input key-value pairs and generates intermediate key-value pairs. The reduce function processes intermediate key-value pairs and produces final output key-value pairs. The model is flexible, allowing for various data types and custom input/output formats. The system is implemented on a large cluster of commodity PCs, with a gigabit Ethernet network and a distributed file system (GFS) for storage. The system handles machine failures by re-executing tasks and ensuring fault tolerance through atomic commit operations. It also optimizes locality by scheduling tasks on machines with local copies of input data