GraphLab: A New Framework For Parallel Machine Learning

GraphLab: A New Framework For Parallel Machine Learning

| Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph Hellerstein
GraphLab is a new parallel framework for machine learning (ML) that addresses the challenges of designing and implementing efficient, provably correct parallel ML algorithms. Existing high-level abstractions like MapReduce are insufficiently expressive, while low-level tools like MPI and Pthreads require ML experts to repeatedly solve the same design challenges. GraphLab improves upon such abstractions by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving high parallel performance. It is designed to target common patterns in ML, such as sparse data dependencies and asynchronous iterative computation, and provides a high-level data representation that insulates users from the complexities of synchronization, data races, and deadlocks. GraphLab enables ML experts to easily design and implement efficient scalable parallel algorithms by composing problem-specific computation, data dependencies, and scheduling. It provides an efficient shared-memory implementation and uses it to build parallel versions of four popular ML algorithms: belief propagation, Gibbs sampling, Co-EM, Lasso, and compressed sensing. The framework includes a graph-based data model that simultaneously represents data and computational dependencies, a set of concurrent access models that provide a range of sequential-consistency guarantees, a sophisticated modular scheduling mechanism, and an aggregation framework to manage global state. GraphLab's data model consists of a directed data graph and a shared data table. The data graph encodes both the problem-specific sparse computational structure and directly modifiable program state. The shared data table is an associative map between keys and arbitrary blocks of data. GraphLab's update functions operate on the data associated with small neighborhoods in the graph and represent the core element of computation. The sync mechanism aggregates data across all vertices in the graph in a manner analogous to the Fold and Reduce operations in functional programming. GraphLab provides three data consistency models: full consistency, edge consistency, and vertex consistency. These models allow users to balance performance and data consistency. The framework also provides a rich collection of parallel schedulers, including a synchronous scheduler for Jacobi-style algorithms and a round-robin scheduler for Gauss-Seidel-style algorithms. GraphLab also provides a scheduler construction framework called the set scheduler, which enables users to safely and easily compose custom update schedules. GraphLab has been evaluated on large real-world problems, including retinal image denoising, Gibbs sampling, Co-EM, and Lasso. The results demonstrate that GraphLab achieves excellent parallel performance on large-scale problems. The framework has the potential to be an interface between the ML and systems communities, as parallel ML algorithms built around the GraphLab API automatically benefit from developments in parallel data structures. As new locking protocols and parallel scheduling primitives are incorporated into the GraphLab API, they become immediately available to the ML community. Systems experts can more easily port ML algorithms to new parallel hardware by porting the GraphLab API.GraphLab is a new parallel framework for machine learning (ML) that addresses the challenges of designing and implementing efficient, provably correct parallel ML algorithms. Existing high-level abstractions like MapReduce are insufficiently expressive, while low-level tools like MPI and Pthreads require ML experts to repeatedly solve the same design challenges. GraphLab improves upon such abstractions by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving high parallel performance. It is designed to target common patterns in ML, such as sparse data dependencies and asynchronous iterative computation, and provides a high-level data representation that insulates users from the complexities of synchronization, data races, and deadlocks. GraphLab enables ML experts to easily design and implement efficient scalable parallel algorithms by composing problem-specific computation, data dependencies, and scheduling. It provides an efficient shared-memory implementation and uses it to build parallel versions of four popular ML algorithms: belief propagation, Gibbs sampling, Co-EM, Lasso, and compressed sensing. The framework includes a graph-based data model that simultaneously represents data and computational dependencies, a set of concurrent access models that provide a range of sequential-consistency guarantees, a sophisticated modular scheduling mechanism, and an aggregation framework to manage global state. GraphLab's data model consists of a directed data graph and a shared data table. The data graph encodes both the problem-specific sparse computational structure and directly modifiable program state. The shared data table is an associative map between keys and arbitrary blocks of data. GraphLab's update functions operate on the data associated with small neighborhoods in the graph and represent the core element of computation. The sync mechanism aggregates data across all vertices in the graph in a manner analogous to the Fold and Reduce operations in functional programming. GraphLab provides three data consistency models: full consistency, edge consistency, and vertex consistency. These models allow users to balance performance and data consistency. The framework also provides a rich collection of parallel schedulers, including a synchronous scheduler for Jacobi-style algorithms and a round-robin scheduler for Gauss-Seidel-style algorithms. GraphLab also provides a scheduler construction framework called the set scheduler, which enables users to safely and easily compose custom update schedules. GraphLab has been evaluated on large real-world problems, including retinal image denoising, Gibbs sampling, Co-EM, and Lasso. The results demonstrate that GraphLab achieves excellent parallel performance on large-scale problems. The framework has the potential to be an interface between the ML and systems communities, as parallel ML algorithms built around the GraphLab API automatically benefit from developments in parallel data structures. As new locking protocols and parallel scheduling primitives are incorporated into the GraphLab API, they become immediately available to the ML community. Systems experts can more easily port ML algorithms to new parallel hardware by porting the GraphLab API.
Reach us at info@study.space
Understanding GraphLab%3A A New Framework For Parallel Machine Learning