Ray: A Distributed Framework for Emerging AI Applications

Ray: A Distributed Framework for Emerging AI Applications

30 Sep 2018 | Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica
Ray is a distributed framework designed to meet the demands of emerging AI applications, particularly those involving reinforcement learning (RL). It provides a unified interface for both task-parallel and actor-based computations, supported by a single dynamic execution engine. Ray addresses performance and flexibility requirements by using a distributed scheduler and a fault-tolerant store for system control state. It supports fine-grained, heterogeneous computations, dynamic execution, and efficient handling of millions of tasks per second with millisecond-level latencies. Ray unifies training, simulation, and serving for RL applications, offering a flexible computation model that supports both stateless and stateful computations. It uses a dynamic task graph computation model, allowing for efficient load balancing and fault tolerance. Ray's architecture includes an application layer implementing the API and a system layer providing high scalability and fault tolerance. The system layer consists of a global control store (GCS), a distributed scheduler, and a distributed object store. The GCS maintains the system's control state and provides fault tolerance through sharding and replication. The distributed scheduler uses a bottom-up approach to dynamically schedule tasks, ensuring efficient task execution and minimizing latency. The distributed object store stores inputs and outputs of tasks, enabling efficient data sharing and reducing task execution time. Ray's performance is demonstrated through microbenchmarks showing its ability to scale to high throughput with low latency. It outperforms existing systems in terms of scalability and fault tolerance, particularly in RL applications. Ray's flexibility allows it to integrate with existing simulators and deep learning frameworks, making it suitable for a wide range of AI applications. The framework supports both distributed training and serving, with efficient handling of large-scale simulations and policy updates. Ray's dynamic task graph execution engine enables transparent fault tolerance and efficient recovery from task and actor failures.Ray is a distributed framework designed to meet the demands of emerging AI applications, particularly those involving reinforcement learning (RL). It provides a unified interface for both task-parallel and actor-based computations, supported by a single dynamic execution engine. Ray addresses performance and flexibility requirements by using a distributed scheduler and a fault-tolerant store for system control state. It supports fine-grained, heterogeneous computations, dynamic execution, and efficient handling of millions of tasks per second with millisecond-level latencies. Ray unifies training, simulation, and serving for RL applications, offering a flexible computation model that supports both stateless and stateful computations. It uses a dynamic task graph computation model, allowing for efficient load balancing and fault tolerance. Ray's architecture includes an application layer implementing the API and a system layer providing high scalability and fault tolerance. The system layer consists of a global control store (GCS), a distributed scheduler, and a distributed object store. The GCS maintains the system's control state and provides fault tolerance through sharding and replication. The distributed scheduler uses a bottom-up approach to dynamically schedule tasks, ensuring efficient task execution and minimizing latency. The distributed object store stores inputs and outputs of tasks, enabling efficient data sharing and reducing task execution time. Ray's performance is demonstrated through microbenchmarks showing its ability to scale to high throughput with low latency. It outperforms existing systems in terms of scalability and fault tolerance, particularly in RL applications. Ray's flexibility allows it to integrate with existing simulators and deep learning frameworks, making it suitable for a wide range of AI applications. The framework supports both distributed training and serving, with efficient handling of large-scale simulations and policy updates. Ray's dynamic task graph execution engine enables transparent fault tolerance and efficient recovery from task and actor failures.
Reach us at info@study.space
Understanding Ray%3A A Distributed Framework for Emerging AI Applications