[slides and audio] Offline Actor-Critic Reinforcement Learning Scales to Large Models

This paper demonstrates that offline actor-critic reinforcement learning can scale to large models, such as transformers, and follows similar scaling laws as supervised learning. The authors introduce a Perceiver-based actor-critic model, which is designed to handle self- and cross-attention modules. They find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. The key contributions include: 1. **Scalability**: The proposed offline actor-critic algorithm can scale to large models without significant computational overhead. 2. **Scaling Laws**: The performance of offline actor-critic methods follows similar scaling laws to those observed in supervised learning. 3. **Performance**: The model outperforms behavioral cloning on various benchmarks in continuous control tasks, including Gato on Control Suite tasks and real robot benchmarks. 4. **Real-World Applications**: The method enables the training of multi-task policies that can master multiple domains simultaneously, including real robotics tasks, using sub-optimal demonstrations or self-generated data. The paper also discusses the architectural details of the Perceiver-Actor-Critic (PAC) model, which incorporates modality-specific encoders and cross-attention modules to handle different types of input data. The authors provide a detailed analysis of the scaling behavior of PAC, showing that it scales better with increased compute compared to behavioral cloning. Additionally, they demonstrate the effectiveness of PAC in fine-tuning and self-improvement scenarios, achieving high success rates on real robot tasks.This paper demonstrates that offline actor-critic reinforcement learning can scale to large models, such as transformers, and follows similar scaling laws as supervised learning. The authors introduce a Perceiver-based actor-critic model, which is designed to handle self- and cross-attention modules. They find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. The key contributions include: 1. **Scalability**: The proposed offline actor-critic algorithm can scale to large models without significant computational overhead. 2. **Scaling Laws**: The performance of offline actor-critic methods follows similar scaling laws to those observed in supervised learning. 3. **Performance**: The model outperforms behavioral cloning on various benchmarks in continuous control tasks, including Gato on Control Suite tasks and real robot benchmarks. 4. **Real-World Applications**: The method enables the training of multi-task policies that can master multiple domains simultaneously, including real robotics tasks, using sub-optimal demonstrations or self-generated data. The paper also discusses the architectural details of the Perceiver-Actor-Critic (PAC) model, which incorporates modality-specific encoders and cross-attention modules to handle different types of input data. The authors provide a detailed analysis of the scaling behavior of PAC, showing that it scales better with increased compute compared to behavioral cloning. Additionally, they demonstrate the effectiveness of PAC in fine-tuning and self-improvement scenarios, achieving high success rates on real robot tasks.

Offline Actor-Critic Reinforcement Learning Scales to Large Models

8 Feb 2024 | Jost Tobias Springenberg * 1 Abbas Abdolmaleki * 1 Jingwei Zhang * 1 Oliver Groth * 1 Michael Bloesch * 1 Thomas Lampe * 1 Philemon Brakel * 1 Sarah Bechtle * 1 Steven Kapturowski * 1 Roland Hafner * 1 Nicolas Heess 1 Martin Riedmiller 1