Understanding Bigger%2C Regularized%2C Optimistic%3A scaling for compute and sample-efficient continuous control

This paper presents the BRO (Bigger, Regularized, Optimistic) algorithm, which demonstrates that scaling model capacity and domain-specific reinforcement learning (RL) enhancements can significantly improve sample efficiency in continuous control tasks. The key innovation of BRO is the combination of strong regularization with critic network scaling, paired with optimistic exploration, leading to superior performance. BRO outperforms state-of-the-art model-free and model-based algorithms on 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. Notably, BRO is the first model-free algorithm to achieve near-optimal policies in challenging Dog and Humanoid tasks, while being 2.5 times more sample-efficient than the leading model-based algorithm, TD-MPC2. The paper also provides extensive empirical analysis, highlighting the importance of critic model scaling, the trade-offs between model scaling and replay ratio, and the effectiveness of optimistic exploration techniques. The findings underscore the need for new standardized benchmarks to drive consistent progress in sample efficiency and the development of more robust RL algorithms.This paper presents the BRO (Bigger, Regularized, Optimistic) algorithm, which demonstrates that scaling model capacity and domain-specific reinforcement learning (RL) enhancements can significantly improve sample efficiency in continuous control tasks. The key innovation of BRO is the combination of strong regularization with critic network scaling, paired with optimistic exploration, leading to superior performance. BRO outperforms state-of-the-art model-free and model-based algorithms on 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. Notably, BRO is the first model-free algorithm to achieve near-optimal policies in challenging Dog and Humanoid tasks, while being 2.5 times more sample-efficient than the leading model-based algorithm, TD-MPC2. The paper also provides extensive empirical analysis, highlighting the importance of critic model scaling, the trade-offs between model scaling and replay ratio, and the effectiveness of optimistic exploration techniques. The findings underscore the need for new standardized benchmarks to drive consistent progress in sample efficiency and the development of more robust RL algorithms.

Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control

25 May 2024 | Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, Marek Cygan