8 Feb 2024 | Jost Tobias Springenberg, Abbas Abdolmaleki, Jingwei Zhang, Oliver Groth, Michael Bloesch, Thomas Lampe, Philemon Brakel, Sarah Bechtle, Steven Kapturowski, Roland Hafner, Nicolas Heess, Martin Riedmiller
This paper presents a scalable offline actor-critic reinforcement learning (RL) approach that can be applied to large models, such as transformers, and follows similar scaling laws as supervised learning. The authors introduce a Perceiver-based actor-critic model (PAC) that enables offline RL to work with self- and cross-attention modules. They demonstrate that offline actor-critic algorithms can outperform strong supervised behavior cloning (BC) baselines for multi-task training on a large dataset containing both suboptimal and expert behavior across 132 continuous control tasks. The PAC model is shown to be able to smoothly interpolate between BC and offline RL, and can be trained on heterogeneous, multi-modal data of varying quality. The model also enables a seamless transition into offline and online RL fine-tuning for fast adaptation and mastery of control tasks.
The authors analyze the scaling behavior of their model and find that it follows similar scaling laws to those observed in supervised learning. They also show that their model can be used to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data. The model is trained using a KL-regularized RL objective, which allows for a natural transition from BC to RL. The authors also introduce architectural advances that enable training with RL at scale, such as incorporating the action into the Q-function via cross-attention and using Perceiver-style cross-attention to learned latent variables.
The experiments show that the PAC model outperforms BC on a number of benchmarks in continuous control, including outperforming Gato on Control Suite tasks and recovering expert performance from heterogeneous data in a real robot benchmark. The authors also demonstrate that their model can be fine-tuned using self-generated data to further improve performance on real-world tasks. The results show that offline RL can be effectively applied after pre-training without any model changes, enabling the mastery of a real robot task from a 70% to a 90% success rate using RL and autonomously collected data. The scaling analysis provides insights into the optimal model sizes and training durations for their datasets and indicates that the performance of offline RL scales better with compute than pure BC. The system allows for a gradual and stable transition between BC and RL learning, and can process data of various modalities simultaneously, while remaining efficient enough to allow their biggest model to control a real robot at 20 Hz.This paper presents a scalable offline actor-critic reinforcement learning (RL) approach that can be applied to large models, such as transformers, and follows similar scaling laws as supervised learning. The authors introduce a Perceiver-based actor-critic model (PAC) that enables offline RL to work with self- and cross-attention modules. They demonstrate that offline actor-critic algorithms can outperform strong supervised behavior cloning (BC) baselines for multi-task training on a large dataset containing both suboptimal and expert behavior across 132 continuous control tasks. The PAC model is shown to be able to smoothly interpolate between BC and offline RL, and can be trained on heterogeneous, multi-modal data of varying quality. The model also enables a seamless transition into offline and online RL fine-tuning for fast adaptation and mastery of control tasks.
The authors analyze the scaling behavior of their model and find that it follows similar scaling laws to those observed in supervised learning. They also show that their model can be used to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data. The model is trained using a KL-regularized RL objective, which allows for a natural transition from BC to RL. The authors also introduce architectural advances that enable training with RL at scale, such as incorporating the action into the Q-function via cross-attention and using Perceiver-style cross-attention to learned latent variables.
The experiments show that the PAC model outperforms BC on a number of benchmarks in continuous control, including outperforming Gato on Control Suite tasks and recovering expert performance from heterogeneous data in a real robot benchmark. The authors also demonstrate that their model can be fine-tuned using self-generated data to further improve performance on real-world tasks. The results show that offline RL can be effectively applied after pre-training without any model changes, enabling the mastery of a real robot task from a 70% to a 90% success rate using RL and autonomously collected data. The scaling analysis provides insights into the optimal model sizes and training durations for their datasets and indicates that the performance of offline RL scales better with compute than pure BC. The system allows for a gradual and stable transition between BC and RL learning, and can process data of various modalities simultaneously, while remaining efficient enough to allow their biggest model to control a real robot at 20 Hz.