[slides and audio] Near-Optimal Reinforcement Learning in Polynomial Time

The paper presents new algorithms for reinforcement learning in Markov decision processes (MDPs) and proves that these algorithms achieve near-optimal performance in polynomial time. The authors focus on the exploration-exploitation trade-off, where the agent must balance the need to explore different actions to learn about the optimal policy with the need to exploit the actions that are currently believed to be optimal. The key contributions include: 1. **Polynomial Bounds**: The algorithms require a number of actions and computation time that are polynomial in the mixing time of the optimal policy (for the undiscounted case) or the horizon time (for the discounted case), the number of states, and the maximum payoff. 2. **Exploration-Exploitation Handling**: The algorithms explicitly handle the exploration-exploitation trade-off by maintaining a partial model of the MDP and using it to either find a near-optimal policy or to improve statistics in unknown or unvisited states. 3. **Simulation Lemma**: A lemma that shows that an approximation of the MDP can be used to approximate the return of policies in the induced known-state MDP. 4. **Explore or Exploit Lemma**: A lemma that formalizes the intuition that either the optimal policy will yield high returns by staying in known states or it will leave these states within a certain time frame, leading to exploration. 5. **Off-line computations**: The algorithms perform off-line computations on the partial model to find either an exploitation policy or an exploration policy, ensuring that the agent either achieves near-optimal performance or quickly discovers new states. The main theorem states that the algorithms achieve near-optimal performance in polynomial time, with the running time being polynomial in the mixing or horizon time, the number of states, and the maximum payoff. The paper also discusses the assumptions and limitations of the results, including the assumption that the agent can observe the state of the environment and the handling of function approximation and partially observable MDPs.The paper presents new algorithms for reinforcement learning in Markov decision processes (MDPs) and proves that these algorithms achieve near-optimal performance in polynomial time. The authors focus on the exploration-exploitation trade-off, where the agent must balance the need to explore different actions to learn about the optimal policy with the need to exploit the actions that are currently believed to be optimal. The key contributions include: 1. **Polynomial Bounds**: The algorithms require a number of actions and computation time that are polynomial in the mixing time of the optimal policy (for the undiscounted case) or the horizon time (for the discounted case), the number of states, and the maximum payoff. 2. **Exploration-Exploitation Handling**: The algorithms explicitly handle the exploration-exploitation trade-off by maintaining a partial model of the MDP and using it to either find a near-optimal policy or to improve statistics in unknown or unvisited states. 3. **Simulation Lemma**: A lemma that shows that an approximation of the MDP can be used to approximate the return of policies in the induced known-state MDP. 4. **Explore or Exploit Lemma**: A lemma that formalizes the intuition that either the optimal policy will yield high returns by staying in known states or it will leave these states within a certain time frame, leading to exploration. 5. **Off-line computations**: The algorithms perform off-line computations on the partial model to find either an exploitation policy or an exploration policy, ensuring that the agent either achieves near-optimal performance or quickly discovers new states. The main theorem states that the algorithms achieve near-optimal performance in polynomial time, with the running time being polynomial in the mixing or horizon time, the number of states, and the maximum payoff. The paper also discusses the assumptions and limitations of the results, including the assumption that the agent can observe the state of the environment and the handling of function approximation and partially observable MDPs.

Near-Optimal Reinforcement Learning in Polynomial Time

2002 | MICHAEL KEARNS, SATINDER SINGH