More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

February 13, 2024 | Kaiwen Wang*, Owen Oertell, Alekh Agarwal, Nathan Kallus, and Wen Sun
This paper presents the benefits of Distributional Reinforcement Learning (DistRL), which learns the return distribution, in achieving second-order bounds in both online and offline reinforcement learning (RL) settings with function approximation. DistRL is shown to provide tighter bounds than previously known small-loss bounds, particularly in low-rank MDPs and offline RL. In the context of contextual bandits, a distributional learning-based optimism algorithm achieves both second-order worst-case regret and gap-dependent bounds. Empirical results demonstrate the effectiveness of DistRL in real-world datasets. The analysis of DistRL is relatively simple, based on the general framework of optimism in the face of uncertainty and does not require weighted regression. The results suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, reinforcing its benefits. The paper also provides theoretical and empirical results for online and offline RL, showing that DistRL achieves tighter bounds by scaling with the variance of the policy's cumulative cost rather than the minimum expected cumulative cost. The analysis is extended to low-rank MDPs and offline RL with single-policy coverage, and the paper concludes with a discussion of the computational efficiency of DistRL and its potential for further improvements.This paper presents the benefits of Distributional Reinforcement Learning (DistRL), which learns the return distribution, in achieving second-order bounds in both online and offline reinforcement learning (RL) settings with function approximation. DistRL is shown to provide tighter bounds than previously known small-loss bounds, particularly in low-rank MDPs and offline RL. In the context of contextual bandits, a distributional learning-based optimism algorithm achieves both second-order worst-case regret and gap-dependent bounds. Empirical results demonstrate the effectiveness of DistRL in real-world datasets. The analysis of DistRL is relatively simple, based on the general framework of optimism in the face of uncertainty and does not require weighted regression. The results suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, reinforcing its benefits. The paper also provides theoretical and empirical results for online and offline RL, showing that DistRL achieves tighter bounds by scaling with the variance of the policy's cumulative cost rather than the minimum expected cumulative cost. The analysis is extended to low-rank MDPs and offline RL with single-policy coverage, and the paper concludes with a discussion of the computational efficiency of DistRL and its potential for further improvements.
Reach us at info@study.space