Q-value Regularized Transformer for Offline Reinforcement Learning

Q-value Regularized Transformer for Offline Reinforcement Learning

2024 | Shengchao Hu, Ziqing Fan, Chaoqin Huang, Li Shen, Ya Zhang, Yanfeng Wang, Dacheng Tao
This paper introduces the Q-value regularized Transformer (QT), a novel approach for offline reinforcement learning (RL) that combines the trajectory modeling ability of the Transformer with the predictability of optimal future returns from dynamic programming (DP) methods. QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM), aiming to seek optimal actions that align closely with the behavior policy. Empirical evaluations on D4RL benchmark datasets demonstrate that QT outperforms traditional DP and CSM methods, highlighting its potential to enhance the state-of-the-art in offline RL. Offline RL aims to learn effective policies from previously collected data without interacting with the environment. Recent advancements in offline RL have shifted focus from policy regularization or value function approximation to a generic Conditional Sequence Modeling (CSM) task, where past experiences are input to a Transformer. This approach converts offline RL into a supervised learning problem, allowing the model to handle long sequences and avoid stability issues associated with bootstrapping. However, the CSM approach struggles with stitching optimal trajectories from suboptimal ones due to inconsistencies between sampled returns and optimal returns. Dynamic Programming methods offer a solution by approximating optimal future returns, but they are prone to unstable learning behaviors, especially in long-horizon and sparse-reward scenarios. QT addresses these challenges by integrating a Q-value module into the Transformer policy, enabling the selection of high-reward actions while maintaining the original trajectory modeling ability. QT's policy is based on a Transformer structure with an objective loss comprising two components: a conditional behavior cloning term that aligns the Transformer's action sampling with the training set's distribution, and a policy improvement term for selecting high-reward actions according to the learned Q-value. This hybrid structure offers multiple advantages, including effective distribution-matching, identification of high-reward actions, and a balance between selecting optimal actions and maintaining fidelity to the behavior policy. The results show that QT consistently achieves superior performance across various domains, including Gym, Adroit, Kitchen, Maze2D, and AntMaze tasks. QT outperforms existing methods in terms of stitching ability, sparse reward handling, and long task horizon management. The method is evaluated on the D4RL benchmark and demonstrates significant improvements over traditional DP and CSM approaches. The Q-value module in QT enhances the policy by enabling preferential sampling of high-value actions, aligning the learning process more closely with optimal returns. This leads to improved performance compared to the baseline behavior policy.This paper introduces the Q-value regularized Transformer (QT), a novel approach for offline reinforcement learning (RL) that combines the trajectory modeling ability of the Transformer with the predictability of optimal future returns from dynamic programming (DP) methods. QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM), aiming to seek optimal actions that align closely with the behavior policy. Empirical evaluations on D4RL benchmark datasets demonstrate that QT outperforms traditional DP and CSM methods, highlighting its potential to enhance the state-of-the-art in offline RL. Offline RL aims to learn effective policies from previously collected data without interacting with the environment. Recent advancements in offline RL have shifted focus from policy regularization or value function approximation to a generic Conditional Sequence Modeling (CSM) task, where past experiences are input to a Transformer. This approach converts offline RL into a supervised learning problem, allowing the model to handle long sequences and avoid stability issues associated with bootstrapping. However, the CSM approach struggles with stitching optimal trajectories from suboptimal ones due to inconsistencies between sampled returns and optimal returns. Dynamic Programming methods offer a solution by approximating optimal future returns, but they are prone to unstable learning behaviors, especially in long-horizon and sparse-reward scenarios. QT addresses these challenges by integrating a Q-value module into the Transformer policy, enabling the selection of high-reward actions while maintaining the original trajectory modeling ability. QT's policy is based on a Transformer structure with an objective loss comprising two components: a conditional behavior cloning term that aligns the Transformer's action sampling with the training set's distribution, and a policy improvement term for selecting high-reward actions according to the learned Q-value. This hybrid structure offers multiple advantages, including effective distribution-matching, identification of high-reward actions, and a balance between selecting optimal actions and maintaining fidelity to the behavior policy. The results show that QT consistently achieves superior performance across various domains, including Gym, Adroit, Kitchen, Maze2D, and AntMaze tasks. QT outperforms existing methods in terms of stitching ability, sparse reward handling, and long task horizon management. The method is evaluated on the D4RL benchmark and demonstrates significant improvements over traditional DP and CSM approaches. The Q-value module in QT enhances the policy by enabling preferential sampling of high-value actions, aligning the learning process more closely with optimal returns. This leads to improved performance compared to the baseline behavior policy.
Reach us at info@study.space
[slides] Q-value Regularized Transformer for Offline Reinforcement Learning | StudySpace