[slides and audio] Discovering Temporally-Aware Reinforcement Learning Algorithms

This paper addresses the challenge of discovering temporally-aware reinforcement learning (RL) algorithms that can adapt to varying training horizons. Recent advancements in meta-learning have enabled the automatic discovery of novel RL algorithms parameterized by surrogate objective functions. However, existing methods often fail to consider the total number of training steps, which is a critical factor in human learning and RL. The authors propose a simple augmentation to two existing meta-learned objective discovery approaches—Learned Policy Gradient (LPG) and Learned Policy Optimization (LPO)—to allow the discovered algorithm to dynamically update its objective function throughout the agent's training process. This augmentation, referred to as *temporally-aware* variants (TA-LPG and TA-LPO), incorporates temporal information about the agent's lifetime, enabling the algorithm to adapt its learning rules over time. The paper compares meta-gradient and evolutionary meta-optimization approaches for discovering non-myopic RL objective functions. Meta-gradient approaches, which are limited by memory constraints and backpropagation through time, fail to learn temporally-aware updates. In contrast, evolutionary strategies (ES) can optimize over the entire lifetime of an agent, leading to more dynamic and adaptive learning rules. The authors demonstrate the effectiveness of their approach on a wide range of tasks, including continuous control and discrete Atari-like environments, showing significant improvements in performance and generalization to unseen training horizons and environments. The learned algorithms exhibit a balance between exploration and exploitation by modifying their learning rules throughout the agent's lifetime. The dynamic schedules extracted from the learned algorithms include policy importance ratio clipping, update norms, and entropy annealing, which adapt to the training horizon. The paper also analyzes the learned update schedules, showing that TA-LPO behaves similarly to LPO at the beginning of training but becomes more pessimistic over time, reflecting a shift from optimism to pessimism. In conclusion, the proposed temporally-aware variants of LPG and LPO demonstrate strong generalization and adaptability, highlighting the importance of expressive and adaptive learning algorithms in RL. The findings suggest that evolutionary optimization is a critical factor in discovering RL algorithms capable of effective lifetime conditioning.This paper addresses the challenge of discovering temporally-aware reinforcement learning (RL) algorithms that can adapt to varying training horizons. Recent advancements in meta-learning have enabled the automatic discovery of novel RL algorithms parameterized by surrogate objective functions. However, existing methods often fail to consider the total number of training steps, which is a critical factor in human learning and RL. The authors propose a simple augmentation to two existing meta-learned objective discovery approaches—Learned Policy Gradient (LPG) and Learned Policy Optimization (LPO)—to allow the discovered algorithm to dynamically update its objective function throughout the agent's training process. This augmentation, referred to as *temporally-aware* variants (TA-LPG and TA-LPO), incorporates temporal information about the agent's lifetime, enabling the algorithm to adapt its learning rules over time. The paper compares meta-gradient and evolutionary meta-optimization approaches for discovering non-myopic RL objective functions. Meta-gradient approaches, which are limited by memory constraints and backpropagation through time, fail to learn temporally-aware updates. In contrast, evolutionary strategies (ES) can optimize over the entire lifetime of an agent, leading to more dynamic and adaptive learning rules. The authors demonstrate the effectiveness of their approach on a wide range of tasks, including continuous control and discrete Atari-like environments, showing significant improvements in performance and generalization to unseen training horizons and environments. The learned algorithms exhibit a balance between exploration and exploitation by modifying their learning rules throughout the agent's lifetime. The dynamic schedules extracted from the learned algorithms include policy importance ratio clipping, update norms, and entropy annealing, which adapt to the training horizon. The paper also analyzes the learned update schedules, showing that TA-LPO behaves similarly to LPO at the beginning of training but becomes more pessimistic over time, reflecting a shift from optimism to pessimism. In conclusion, the proposed temporally-aware variants of LPG and LPO demonstrate strong generalization and adaptability, highlighting the importance of expressive and adaptive learning algorithms in RL. The findings suggest that evolutionary optimization is a critical factor in discovering RL algorithms capable of effective lifetime conditioning.

DISCOVERING TEMPORALLY-AWARE REINFORCEMENT LEARNING ALGORITHMS

8 Feb 2024 | Matthew T. Jackson, Chris Lu, Louis Kirsch, Robert T. Lange, Shimon Whiteson, Jakob N. Foerster

DISCOVERING TEMPORALLY-AWARE REINFORCEMENT LEARNING ALGORITHMS

8 Feb 2024 | Matthew T. Jackson*, Chris Lu*, Louis Kirsch, Robert T. Lange, Shimon Whiteson, Jakob N. Foerster

8 Feb 2024 | Matthew T. Jackson, Chris Lu, Louis Kirsch, Robert T. Lange, Shimon Whiteson, Jakob N. Foerster