[slides] Evaluating Real-World Robot Manipulation Policies in Simulation

This paper addresses the challenge of evaluating generalist robot manipulation policies in a scalable, reproducible, and reliable manner. Traditional real-world evaluations are costly and inefficient, especially as policies expand their capabilities. The authors propose using simulation-based evaluation as a scalable proxy, where policies trained on real data are evaluated in purpose-built simulated environments. They introduce SIMPLER, a suite of open-source simulated environments for common real robot manipulation setups, including the Google Robot and WidowX. Through extensive paired sim-and-real evaluations, they demonstrate strong correlation between policy performance in SIMPLER and real-world environments. The authors also show that SIMPLER evaluations accurately reflect policy behavior modes, such as sensitivity to various distribution shifts. They propose and evaluate methods to mitigate control and visual disparities between real and simulated environments, including system identification, "green-screening," and texture tuning. The results highlight the potential of simulation-based evaluation for evaluating generalist robot manipulation policies in a scalable and reliable way.This paper addresses the challenge of evaluating generalist robot manipulation policies in a scalable, reproducible, and reliable manner. Traditional real-world evaluations are costly and inefficient, especially as policies expand their capabilities. The authors propose using simulation-based evaluation as a scalable proxy, where policies trained on real data are evaluated in purpose-built simulated environments. They introduce SIMPLER, a suite of open-source simulated environments for common real robot manipulation setups, including the Google Robot and WidowX. Through extensive paired sim-and-real evaluations, they demonstrate strong correlation between policy performance in SIMPLER and real-world environments. The authors also show that SIMPLER evaluations accurately reflect policy behavior modes, such as sensitivity to various distribution shifts. They propose and evaluate methods to mitigate control and visual disparities between real and simulated environments, including system identification, "green-screening," and texture tuning. The results highlight the potential of simulation-based evaluation for evaluating generalist robot manipulation policies in a scalable and reliable way.

Evaluating Real-World Robot Manipulation Policies in Simulation

9 May 2024 | Xuanlin Li*1, Kyle Hsu*2, Jiayuan Gu*1, Karl Pertsch2 3 †, Oier Mees3 1 †, Homer Rich Walke3, Chuyuan Fu4, Ishikaa Lunawat2, Isabel Sieh2, Sean Kirmani4, Sergey Levine3, Jiajun Wu2, Chelsea Finn2, Hao Su11, Quan Vuong14, Ted Xiao44

9 May 2024 | Xuanlin Li1, Kyle Hsu2, Jiayuan Gu*1, Karl Pertsch2 3 †, Oier Mees3 1 †, Homer Rich Walke3, Chuyuan Fu4, Ishikaa Lunawat2, Isabel Sieh2, Sean Kirmani4, Sergey Levine3, Jiajun Wu2, Chelsea Finn2, Hao Su11, Quan Vuong14, Ted Xiao44