16 Mar 2011 | Stéphane Ross, Geoffrey J. Gordon, J. Andrew Bagnell
This paper presents a reduction of imitation learning and structured prediction to no-regret online learning. The authors propose a new iterative algorithm, DAGGER (Dataset Aggregation), which trains a stationary deterministic policy that performs well under its induced distribution of states. DAGGER is shown to outperform previous approaches on two challenging imitation learning problems and a benchmark sequence labeling task. The algorithm is based on a reduction-based approach that enables reusing existing supervised learning algorithms. It is simple to implement, has no free parameters except the supervised learning algorithm sub-routine, and requires a number of iterations that scales nearly linearly with the effective horizon of the problem. DAGGER is closely related to no-regret online learning algorithms but better leverages the expert in the setting. The authors demonstrate that any no-regret learner can be used in a particular fashion to learn a policy that achieves similar guarantees. The paper also provides theoretical analysis of DAGGER, showing that it achieves linear or near-linear growth in the task horizon T and error ε. The algorithm is shown to outperform previous approaches in practice on two challenging imitation learning problems: learning to steer a car in a 3D racing game (Super Tux Kart) and learning to play Super Mario Bros. The results also show that DAGGER performs well on a structured prediction task involving handwriting recognition. The authors conclude that by batching over iterations of interaction with a system, no-regret methods, including DAGGER, can provide a learning reduction with strong performance guarantees in both imitation learning and structured prediction. Future work includes considering more sophisticated strategies for structured prediction and using base classifiers that rely on inverse optimal control techniques to learn a cost function for a planner to aid prediction in imitation learning.This paper presents a reduction of imitation learning and structured prediction to no-regret online learning. The authors propose a new iterative algorithm, DAGGER (Dataset Aggregation), which trains a stationary deterministic policy that performs well under its induced distribution of states. DAGGER is shown to outperform previous approaches on two challenging imitation learning problems and a benchmark sequence labeling task. The algorithm is based on a reduction-based approach that enables reusing existing supervised learning algorithms. It is simple to implement, has no free parameters except the supervised learning algorithm sub-routine, and requires a number of iterations that scales nearly linearly with the effective horizon of the problem. DAGGER is closely related to no-regret online learning algorithms but better leverages the expert in the setting. The authors demonstrate that any no-regret learner can be used in a particular fashion to learn a policy that achieves similar guarantees. The paper also provides theoretical analysis of DAGGER, showing that it achieves linear or near-linear growth in the task horizon T and error ε. The algorithm is shown to outperform previous approaches in practice on two challenging imitation learning problems: learning to steer a car in a 3D racing game (Super Tux Kart) and learning to play Super Mario Bros. The results also show that DAGGER performs well on a structured prediction task involving handwriting recognition. The authors conclude that by batching over iterations of interaction with a system, no-regret methods, including DAGGER, can provide a learning reduction with strong performance guarantees in both imitation learning and structured prediction. Future work includes considering more sophisticated strategies for structured prediction and using base classifiers that rely on inverse optimal control techniques to learn a cost function for a planner to aid prediction in imitation learning.