26 May 2015 | Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar
MLlib is Spark's open-source distributed machine learning library. It provides efficient functionality for a wide range of learning settings, including classification, regression, collaborative filtering, clustering, and dimensionality reduction. MLlib includes several underlying statistical, optimization, and linear algebra primitives. It supports multiple languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced rapid growth due to its vibrant open-source community of over 140 contributors and includes extensive documentation to support further growth and help users quickly get up to speed.
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. MLlib is designed for large-scale learning settings that benefit from data-parallelism or model-parallelism. It consists of fast and scalable implementations of standard learning algorithms for common learning settings. MLlib is written in Scala and uses native (C++ based) linear algebra libraries on each node, and includes Java, Scala, and Python APIs. It is released as part of the Spark project under the Apache 2.0 license.
MLlib's tight integration with Spark results in several benefits. First, since Spark is designed with iterative computation in mind, it enables the development of efficient implementations of large-scale machine learning algorithms. Improvements in low-level components of Spark often translate into performance gains in MLlib. Second, Spark's vibrant open-source community has led to rapid growth and adoption of MLlib. Third, MLlib is one of several high-level libraries built on top of Spark, and provides developers with a wide range of tools to simplify the development of machine learning pipelines.
MLlib includes a variety of underlying statistics, linear algebra, and optimization primitives. It provides a number of low-level primitives and basic utilities for convex optimization, distributed linear algebra, statistical analysis, and feature extraction. It supports various I/O formats, including native support for LIBSVM format, data integration via Spark SQL, as well as PMML and MLlib's internal format for model export.
MLlib includes many optimizations to support efficient distributed learning and prediction. The ALS algorithm for recommendation makes careful use of blocking to reduce JVM garbage collection overhead and to utilize higher-level linear algebra operations. Decision trees use many ideas from the PLANET project, such as data-dependent feature discretization to reduce communication costs, and tree ensembles parallelize learning both within trees and across trees. Generalized linear models are learned via optimization algorithms which parallelize gradient computation, using fast C++-based linear algebra libraries for worker computations. Many algorithms benefit from efficient communication primitives; in particular tree-structured aggregation prevents the driver from being a bottleneck, and Spark broadcast quickly distributes large models to workers.
MLlib includes a Pipeline API that simplifies the development and tuning of multi-stage learning pipelines by providing a uniform set of high-level APIs. Spark Integration allows MLib to benefit from the various components within the Spark ecosystemMLlib is Spark's open-source distributed machine learning library. It provides efficient functionality for a wide range of learning settings, including classification, regression, collaborative filtering, clustering, and dimensionality reduction. MLlib includes several underlying statistical, optimization, and linear algebra primitives. It supports multiple languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced rapid growth due to its vibrant open-source community of over 140 contributors and includes extensive documentation to support further growth and help users quickly get up to speed.
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. MLlib is designed for large-scale learning settings that benefit from data-parallelism or model-parallelism. It consists of fast and scalable implementations of standard learning algorithms for common learning settings. MLlib is written in Scala and uses native (C++ based) linear algebra libraries on each node, and includes Java, Scala, and Python APIs. It is released as part of the Spark project under the Apache 2.0 license.
MLlib's tight integration with Spark results in several benefits. First, since Spark is designed with iterative computation in mind, it enables the development of efficient implementations of large-scale machine learning algorithms. Improvements in low-level components of Spark often translate into performance gains in MLlib. Second, Spark's vibrant open-source community has led to rapid growth and adoption of MLlib. Third, MLlib is one of several high-level libraries built on top of Spark, and provides developers with a wide range of tools to simplify the development of machine learning pipelines.
MLlib includes a variety of underlying statistics, linear algebra, and optimization primitives. It provides a number of low-level primitives and basic utilities for convex optimization, distributed linear algebra, statistical analysis, and feature extraction. It supports various I/O formats, including native support for LIBSVM format, data integration via Spark SQL, as well as PMML and MLlib's internal format for model export.
MLlib includes many optimizations to support efficient distributed learning and prediction. The ALS algorithm for recommendation makes careful use of blocking to reduce JVM garbage collection overhead and to utilize higher-level linear algebra operations. Decision trees use many ideas from the PLANET project, such as data-dependent feature discretization to reduce communication costs, and tree ensembles parallelize learning both within trees and across trees. Generalized linear models are learned via optimization algorithms which parallelize gradient computation, using fast C++-based linear algebra libraries for worker computations. Many algorithms benefit from efficient communication primitives; in particular tree-structured aggregation prevents the driver from being a bottleneck, and Spark broadcast quickly distributes large models to workers.
MLlib includes a Pipeline API that simplifies the development and tuning of multi-stage learning pipelines by providing a uniform set of high-level APIs. Spark Integration allows MLib to benefit from the various components within the Spark ecosystem