February 13, 2024 | Kaiwen Wang*1, Owen Oertell1, Alekh Agarwal2, Nathan Kallus1, and Wen Sun1
This paper demonstrates that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can achieve second-order bounds in both online and offline reinforcement learning (RL) settings with function approximation. Second-order bounds are instance-dependent and scale with the variance of the return, providing tighter guarantees compared to previously known small-loss bounds. The authors prove that these bounds are strictly tighter than first-order bounds, which scale with the minimum possible expected cumulative cost. In the context of contextual bandits, a distributional learning-based optimism algorithm achieves a second-order worst-case regret bound and a second-order gap-dependent bound. Empirical results show that the distributionally optimistic algorithm outperforms the squared loss baseline in real-world datasets. The analysis is relatively simple, following the general framework of optimism in the face of uncertainty and does not require weighted regression. The findings suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, further reinforcing its benefits.This paper demonstrates that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can achieve second-order bounds in both online and offline reinforcement learning (RL) settings with function approximation. Second-order bounds are instance-dependent and scale with the variance of the return, providing tighter guarantees compared to previously known small-loss bounds. The authors prove that these bounds are strictly tighter than first-order bounds, which scale with the minimum possible expected cumulative cost. In the context of contextual bandits, a distributional learning-based optimism algorithm achieves a second-order worst-case regret bound and a second-order gap-dependent bound. Empirical results show that the distributionally optimistic algorithm outperforms the squared loss baseline in real-world datasets. The analysis is relatively simple, following the general framework of optimism in the face of uncertainty and does not require weighted regression. The findings suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, further reinforcing its benefits.