[slides and audio] THE COLOSSEUM%3A A Benchmark for Evaluating Generalization for Robotic Manipulation

The COLOSSEUM is a new simulation benchmark designed to evaluate the generalization of robotic manipulation policies across various environmental perturbations. It includes 20 diverse tasks and 14 perturbation factors, such as object color, texture, size, lighting, and camera pose. The benchmark systematically tests how well models adapt to changes in these factors, revealing that success rates can drop by 30-50% when multiple perturbations are applied. The results show that changing the number of distractors, target object color, or lighting conditions most significantly affects model performance. The benchmark also demonstrates strong correlation between simulation and real-world results, with a correlation coefficient of 0.614. The COLOSSEUM provides open-source code and 3D-printed objects to replicate real-world perturbations, enabling researchers to evaluate and compare robotic manipulation methods. The benchmark supports both 2D and 3D models, with 3D-based models showing superior performance in terms of task success and robustness to environmental changes. The COLOSSEUM Challenge encourages the development of generalizable behavior cloning models, with participants generating training data, training models, and evaluating them across various perturbations. The benchmark's real-world extension includes 4 tasks with 3D-printed objects, and results show that simulation aligns well with real-world performance. The benchmark highlights the importance of pretraining on real-world data and the effectiveness of 3D models in handling environmental variations. The COLOSSEUM provides a unified platform for evaluating and comparing robotic manipulation methods, with a focus on robustness and generalization.The COLOSSEUM is a new simulation benchmark designed to evaluate the generalization of robotic manipulation policies across various environmental perturbations. It includes 20 diverse tasks and 14 perturbation factors, such as object color, texture, size, lighting, and camera pose. The benchmark systematically tests how well models adapt to changes in these factors, revealing that success rates can drop by 30-50% when multiple perturbations are applied. The results show that changing the number of distractors, target object color, or lighting conditions most significantly affects model performance. The benchmark also demonstrates strong correlation between simulation and real-world results, with a correlation coefficient of 0.614. The COLOSSEUM provides open-source code and 3D-printed objects to replicate real-world perturbations, enabling researchers to evaluate and compare robotic manipulation methods. The benchmark supports both 2D and 3D models, with 3D-based models showing superior performance in terms of task success and robustness to environmental changes. The COLOSSEUM Challenge encourages the development of generalizable behavior cloning models, with participants generating training data, training models, and evaluating them across various perturbations. The benchmark's real-world extension includes 4 tasks with 3D-printed objects, and results show that simulation aligns well with real-world performance. The benchmark highlights the importance of pretraining on real-world data and the effectiveness of 3D models in handling environmental variations. The COLOSSEUM provides a unified platform for evaluating and comparing robotic manipulation methods, with a focus on robustness and generalization.

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

28 May 2024 | Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, Dieter Fox