Understanding Ray%3A A Distributed Framework for Emerging AI Applications

The paper introduces Ray, a distributed system designed to address the demanding systems requirements of emerging AI applications, particularly those in reinforcement learning (RL). Ray implements a unified interface that supports both task-parallel and actor-based computations, managed by a single dynamic execution engine. To meet performance requirements, Ray employs a distributed scheduler and a fault-tolerant metadata store. Experiments demonstrate Ray's scalability, achieving over 1.8 million tasks per second and outperforming existing specialized systems in several challenging RL applications. Ray's design includes a global control store for fault tolerance and low latency, a bottom-up distributed scheduler for efficient task scheduling, and an in-memory distributed object store for minimizing task latency. The paper also evaluates Ray's performance in microbenchmarks, distributed training, serving, and simulation, showing its ability to handle fine-grained, heterogeneous computations and achieve high scalability and fault tolerance.The paper introduces Ray, a distributed system designed to address the demanding systems requirements of emerging AI applications, particularly those in reinforcement learning (RL). Ray implements a unified interface that supports both task-parallel and actor-based computations, managed by a single dynamic execution engine. To meet performance requirements, Ray employs a distributed scheduler and a fault-tolerant metadata store. Experiments demonstrate Ray's scalability, achieving over 1.8 million tasks per second and outperforming existing specialized systems in several challenging RL applications. Ray's design includes a global control store for fault tolerance and low latency, a bottom-up distributed scheduler for efficient task scheduling, and an in-memory distributed object store for minimizing task latency. The paper also evaluates Ray's performance in microbenchmarks, distributed training, serving, and simulation, showing its ability to handle fine-grained, heterogeneous computations and achieve high scalability and fault tolerance.

Ray: A Distributed Framework for Emerging AI Applications

30 Sep 2018 | Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica