Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud

2012 | Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein
Distributed GraphLab is a framework for machine learning and data mining in the cloud. It extends the GraphLab abstraction to a distributed setting while preserving strong data consistency guarantees. The framework introduces graph-based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effects of network latency. Fault tolerance is achieved using the Chandy-Lamport snapshot algorithm, which is implemented within the GraphLab abstraction. The framework is evaluated on a large Amazon EC2 deployment, showing 1-2 orders of magnitude performance gains over Hadoop-based implementations. The GraphLab abstraction consists of three main parts: the data graph, the update function, and the sync operation. The data graph represents user modifiable program state and stores both the mutable user-defined data and encodes the sparse computational dependencies. The update function represents the user computation and operates on the data graph by transforming data in small overlapping contexts called scopes. The sync operation concurrently maintains global aggregates. The framework supports asynchronous iterative computation, dynamic computation, and serializability. It provides two substantially different approaches to implementing the new distributed execution model: the Chromatic Engine, which uses graph coloring to achieve efficient sequentially consistent execution for static schedules, and the Locking Engine, which uses pipelined distributed locking and latency hiding to support dynamically prioritized execution. Fault tolerance is achieved through two snapshotting schemes. The framework is evaluated on three state-of-the-art MLDM applications: collaborative filtering for Netflix movie recommendations, Video Co-segmentation (CoSeg), and Named Entity Recognition (NER). The results show that GraphLab outperforms Hadoop by 20-60x and matches the performance of tailored MPI implementations. The framework's performance scaling improves with higher computation to communication ratios, and the GraphLab abstraction more compactly expresses the Netflix, NER, and CoSeg algorithms than MapReduce or MPI.Distributed GraphLab is a framework for machine learning and data mining in the cloud. It extends the GraphLab abstraction to a distributed setting while preserving strong data consistency guarantees. The framework introduces graph-based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effects of network latency. Fault tolerance is achieved using the Chandy-Lamport snapshot algorithm, which is implemented within the GraphLab abstraction. The framework is evaluated on a large Amazon EC2 deployment, showing 1-2 orders of magnitude performance gains over Hadoop-based implementations. The GraphLab abstraction consists of three main parts: the data graph, the update function, and the sync operation. The data graph represents user modifiable program state and stores both the mutable user-defined data and encodes the sparse computational dependencies. The update function represents the user computation and operates on the data graph by transforming data in small overlapping contexts called scopes. The sync operation concurrently maintains global aggregates. The framework supports asynchronous iterative computation, dynamic computation, and serializability. It provides two substantially different approaches to implementing the new distributed execution model: the Chromatic Engine, which uses graph coloring to achieve efficient sequentially consistent execution for static schedules, and the Locking Engine, which uses pipelined distributed locking and latency hiding to support dynamically prioritized execution. Fault tolerance is achieved through two snapshotting schemes. The framework is evaluated on three state-of-the-art MLDM applications: collaborative filtering for Netflix movie recommendations, Video Co-segmentation (CoSeg), and Named Entity Recognition (NER). The results show that GraphLab outperforms Hadoop by 20-60x and matches the performance of tailored MPI implementations. The framework's performance scaling improves with higher computation to communication ratios, and the GraphLab abstraction more compactly expresses the Netflix, NER, and CoSeg algorithms than MapReduce or MPI.
Reach us at info@study.space
[slides] Distributed GraphLab%3A A Framework for Machine Learning in the Cloud | StudySpace