6 May 2024 | Arpit Bahety, Priyanka Mandikal, Ben AbbateMatteo, Roberto Martín-Martín
SCREWIMIC is a novel framework designed to enable robots to learn bimanual manipulation behaviors from human video demonstrations and fine-tune them through interaction. The key insight behind SCREWIMIC is that many bimanual manipulation tasks can be modeled as a simple one-degree-of-freedom (1-DoF) screw joint, which constrains the relative motion between the two hands. This screw motion is used to define a new action space for bimanual manipulation, called screw actions. SCREWIMIC consists of three main components: a perceptual module that extracts a screw action from a human demonstration, a prediction model that predicts screw actions based on 3D point clouds, and a self-supervised iterative fine-tuning algorithm that refines the predicted screw actions through interaction. Experiments demonstrate that SCREWIMIC can learn complex bimanual behaviors from a single human video demonstration and outperforms baselines that interpret demonstrations directly in the original motion space. The framework is evaluated on six challenging bimanual manipulation tasks, showing robust performance and strong generalization to new object instances and poses.SCREWIMIC is a novel framework designed to enable robots to learn bimanual manipulation behaviors from human video demonstrations and fine-tune them through interaction. The key insight behind SCREWIMIC is that many bimanual manipulation tasks can be modeled as a simple one-degree-of-freedom (1-DoF) screw joint, which constrains the relative motion between the two hands. This screw motion is used to define a new action space for bimanual manipulation, called screw actions. SCREWIMIC consists of three main components: a perceptual module that extracts a screw action from a human demonstration, a prediction model that predicts screw actions based on 3D point clouds, and a self-supervised iterative fine-tuning algorithm that refines the predicted screw actions through interaction. Experiments demonstrate that SCREWIMIC can learn complex bimanual behaviors from a single human video demonstration and outperforms baselines that interpret demonstrations directly in the original motion space. The framework is evaluated on six challenging bimanual manipulation tasks, showing robust performance and strong generalization to new object instances and poses.