Deep Recurrent Q-Learning for Partially Observable MDPs

Deep Recurrent Q-Learning for Partially Observable MDPs

11 Jan 2017 | Matthew Hausknecht and Peter Stone
This paper introduces Deep Recurrent Q-Network (DRQN), a modified Deep Q-Network (DQN) that incorporates a Long Short-Term Memory (LSTM) layer to handle partial observability in Markov Decision Processes (MDPs). DRQN replaces the first fully connected layer of DQN with an LSTM layer, enabling it to process sequential information and better handle partial observations. The LSTM allows DRQN to integrate information across time steps, improving its ability to learn policies in partially observable environments. DRQN was tested on standard Atari games and their flickering equivalents, which are Partially Observable Markov Decision Processes (POMDPs). In standard Atari games, where the full state is observable, DRQN performs similarly to DQN. However, in flickering games where observations are incomplete, DRQN outperforms DQN, demonstrating its ability to handle partial observability. When trained with full observations and evaluated with partial observations, DRQN maintains better performance than DQN, indicating that recurrence is a viable alternative to stacking frames in the input layer of a DQN. The paper also shows that DRQN can generalize its policies to handle complete observations when trained with partial ones. In the Flickering Pong domain, DRQN's performance scales with the observability of the domain, reaching near-perfect levels when every game screen is observed. This suggests that the recurrent network learns policies that are both robust to missing information and scalable with increased observability. While recurrence provides no systematic advantage in learning to play games, it allows the network to better adapt at evaluation time when the quality of observations changes. The paper concludes that recurrence is a viable method for handling state observations, though it does not confer a systematic benefit compared to stacking observations in the input layer of a convolutional network. Future work could focus on identifying the characteristics of specific games that make recurrence effective.This paper introduces Deep Recurrent Q-Network (DRQN), a modified Deep Q-Network (DQN) that incorporates a Long Short-Term Memory (LSTM) layer to handle partial observability in Markov Decision Processes (MDPs). DRQN replaces the first fully connected layer of DQN with an LSTM layer, enabling it to process sequential information and better handle partial observations. The LSTM allows DRQN to integrate information across time steps, improving its ability to learn policies in partially observable environments. DRQN was tested on standard Atari games and their flickering equivalents, which are Partially Observable Markov Decision Processes (POMDPs). In standard Atari games, where the full state is observable, DRQN performs similarly to DQN. However, in flickering games where observations are incomplete, DRQN outperforms DQN, demonstrating its ability to handle partial observability. When trained with full observations and evaluated with partial observations, DRQN maintains better performance than DQN, indicating that recurrence is a viable alternative to stacking frames in the input layer of a DQN. The paper also shows that DRQN can generalize its policies to handle complete observations when trained with partial ones. In the Flickering Pong domain, DRQN's performance scales with the observability of the domain, reaching near-perfect levels when every game screen is observed. This suggests that the recurrent network learns policies that are both robust to missing information and scalable with increased observability. While recurrence provides no systematic advantage in learning to play games, it allows the network to better adapt at evaluation time when the quality of observations changes. The paper concludes that recurrence is a viable method for handling state observations, though it does not confer a systematic benefit compared to stacking observations in the input layer of a convolutional network. Future work could focus on identifying the characteristics of specific games that make recurrence effective.
Reach us at info@study.space
[slides] Deep Recurrent Q-Learning for Partially Observable MDPs | StudySpace