Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control

Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control

25 May 2024 | Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, Marek Cygan
The paper introduces the BRO (Bigger, Regularized, Optimistic) algorithm, a model-free reinforcement learning approach that achieves state-of-the-art performance in continuous control tasks. BRO combines larger critic networks, regularization techniques, and optimistic exploration to improve sample efficiency and performance. The algorithm outperforms leading model-based and model-free methods across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the challenging Dog and Humanoid tasks, and it is 2.5 times more sample-efficient than the leading model-based algorithm, TD-MPC2. The key innovation of BRO is the use of strong regularization to enable effective scaling of critic networks, which, when combined with optimistic exploration, leads to superior performance. The algorithm is designed to scale both the number of parameters and the number of gradient steps, with the former leading to more significant performance gains while being more computationally efficient in parallelized setups. BRO also incorporates domain-specific RL enhancements, including the use of a regularized BroNet architecture for critic scaling and optimistic exploration strategies. The paper presents extensive empirical analysis, showing that regularized critic scaling outperforms replay ratio scaling in terms of performance and computational efficiency. The study also highlights the importance of domain-specific RL improvements, which can largely substitute for critic scaling, leading to simpler algorithms. The results demonstrate that scaling the critic network and using optimistic exploration significantly improves performance in continuous control tasks. The BRO algorithm is implemented using the JaxRL framework and is available for further research. The paper also discusses the limitations of the current benchmarks and advocates for standardized benchmarks that reflect the sample efficiency of modern algorithms. The study concludes that scaling a regularized critic model in conjunction with existing algorithmic enhancements leads to sample-efficient methods for continuous-action RL, with BRO achieving markedly superior performance within 1 million environment steps compared to the state-of-the-art model-based TD-MPC2 and other model-free baselines.The paper introduces the BRO (Bigger, Regularized, Optimistic) algorithm, a model-free reinforcement learning approach that achieves state-of-the-art performance in continuous control tasks. BRO combines larger critic networks, regularization techniques, and optimistic exploration to improve sample efficiency and performance. The algorithm outperforms leading model-based and model-free methods across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the challenging Dog and Humanoid tasks, and it is 2.5 times more sample-efficient than the leading model-based algorithm, TD-MPC2. The key innovation of BRO is the use of strong regularization to enable effective scaling of critic networks, which, when combined with optimistic exploration, leads to superior performance. The algorithm is designed to scale both the number of parameters and the number of gradient steps, with the former leading to more significant performance gains while being more computationally efficient in parallelized setups. BRO also incorporates domain-specific RL enhancements, including the use of a regularized BroNet architecture for critic scaling and optimistic exploration strategies. The paper presents extensive empirical analysis, showing that regularized critic scaling outperforms replay ratio scaling in terms of performance and computational efficiency. The study also highlights the importance of domain-specific RL improvements, which can largely substitute for critic scaling, leading to simpler algorithms. The results demonstrate that scaling the critic network and using optimistic exploration significantly improves performance in continuous control tasks. The BRO algorithm is implemented using the JaxRL framework and is available for further research. The paper also discusses the limitations of the current benchmarks and advocates for standardized benchmarks that reflect the sample efficiency of modern algorithms. The study concludes that scaling a regularized critic model in conjunction with existing algorithmic enhancements leads to sample-efficient methods for continuous-action RL, with BRO achieving markedly superior performance within 1 million environment steps compared to the state-of-the-art model-based TD-MPC2 and other model-free baselines.
Reach us at info@study.space