[slides] Towards Diverse Behaviors%3A A Benchmark for Imitation Learning with Human Demonstrations

The paper introduces the Datasets with Diverse Demonstrations for Imitation Learning (D3IL), a benchmark designed to evaluate models' ability to learn multi-modal behaviors from human demonstrations. The D3IL benchmark includes five tasks (Avoiding, Aligning, Pushing, Sorting-X, and Stacking-X) that involve complex object manipulation and require closed-loop sensory feedback. Each task is designed to capture the diversity of human behavior, with multiple viable approaches to task completion and variable trajectory lengths. The paper also introduces metrics to quantify the diversity of learned behaviors, such as behavior entropy and conditional behavior entropy, which provide insights into a model's ability to replicate diverse actions. The authors conduct a comprehensive evaluation of state-of-the-art imitation learning algorithms on the D3IL tasks, categorizing them based on whether they use history information, predict single actions or future action sequences, and how they model behavior diversity. The evaluation reveals that transformer-based models, particularly those incorporating historical inputs and prediction horizons, outperform MLP-based models in learning diverse behaviors. Additionally, the study highlights the importance of careful design in using transformer encoder-decoder structures for history and prediction horizons, and the effectiveness of diffusion-based methods in handling complex tasks with limited data. The paper concludes by emphasizing the value of the D3IL benchmark for guiding future research in imitation learning, particularly in developing algorithms that can effectively learn and generalize from diverse human behaviors.The paper introduces the Datasets with Diverse Demonstrations for Imitation Learning (D3IL), a benchmark designed to evaluate models' ability to learn multi-modal behaviors from human demonstrations. The D3IL benchmark includes five tasks (Avoiding, Aligning, Pushing, Sorting-X, and Stacking-X) that involve complex object manipulation and require closed-loop sensory feedback. Each task is designed to capture the diversity of human behavior, with multiple viable approaches to task completion and variable trajectory lengths. The paper also introduces metrics to quantify the diversity of learned behaviors, such as behavior entropy and conditional behavior entropy, which provide insights into a model's ability to replicate diverse actions. The authors conduct a comprehensive evaluation of state-of-the-art imitation learning algorithms on the D3IL tasks, categorizing them based on whether they use history information, predict single actions or future action sequences, and how they model behavior diversity. The evaluation reveals that transformer-based models, particularly those incorporating historical inputs and prediction horizons, outperform MLP-based models in learning diverse behaviors. Additionally, the study highlights the importance of careful design in using transformer encoder-decoder structures for history and prediction horizons, and the effectiveness of diffusion-based methods in handling complex tasks with limited data. The paper concludes by emphasizing the value of the D3IL benchmark for guiding future research in imitation learning, particularly in developing algorithms that can effectively learn and generalize from diverse human behaviors.

TOWARDS DIVERSE BEHAVIORS: A BENCHMARK FOR IMITATION LEARNING WITH HUMAN DEMONSTRATIONS

22 Feb 2024 | Xiaogang Jia *,†‡ Denis Blessing† Xinkai Jiang†‡ Moritz Reuss† Atalay Donat† Rudolf Lioutikov† Gerhard Neumann†