[slides and audio] A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

This paper addresses the challenges of sequential prediction problems, such as imitation learning, where future observations depend on previous predictions, violating the common i.i.d. assumptions in statistical learning. It introduces a new iterative algorithm, DAGGER (Dataset Aggregation), which trains a stationary deterministic policy, improving upon previous approaches that either train non-stationary or stochastic policies and require a large number of iterations. DAGGER is shown to be a no-regret algorithm in an online learning setting, ensuring that the learned policy performs well under its induced distribution of states. The paper demonstrates that any no-regret algorithm, combined with additional reduction assumptions, can find a policy with good performance in such sequential settings. Experimental results on two challenging imitation learning problems and a benchmark sequence labeling problem show that DAGGER outperforms previous approaches. The theoretical analysis of DAGGER relies on the no-regret property of the underlying *Follow-The-Leader* algorithm, and the finite sample results provide bounds on the true loss under the policy's distribution. The paper concludes with future directions, including more sophisticated strategies for structured prediction and leveraging cost-to-go estimates in reinforcement learning.This paper addresses the challenges of sequential prediction problems, such as imitation learning, where future observations depend on previous predictions, violating the common i.i.d. assumptions in statistical learning. It introduces a new iterative algorithm, DAGGER (Dataset Aggregation), which trains a stationary deterministic policy, improving upon previous approaches that either train non-stationary or stochastic policies and require a large number of iterations. DAGGER is shown to be a no-regret algorithm in an online learning setting, ensuring that the learned policy performs well under its induced distribution of states. The paper demonstrates that any no-regret algorithm, combined with additional reduction assumptions, can find a policy with good performance in such sequential settings. Experimental results on two challenging imitation learning problems and a benchmark sequence labeling problem show that DAGGER outperforms previous approaches. The theoretical analysis of DAGGER relies on the no-regret property of the underlying *Follow-The-Leader* algorithm, and the finite sample results provide bounds on the true loss under the policy's distribution. The paper concludes with future directions, including more sophisticated strategies for structured prediction and leveraging cost-to-go estimates in reinforcement learning.

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

16 Mar 2011 | Stéphane Ross, Geoffrey J. Gordon, J. Andrew Bagnell