[slides and audio] Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

The paper introduces Keypoint Action Tokens (KAT), a method that repurposes large text-pretrained Transformers for efficient imitation learning in robotics. KAT transforms visual observations and action trajectories into sequences of keypoint-based tokens, which are then converted into text and added to a prompt. This approach enables the Transformers to learn general behaviors from as few as 10 demonstrations without requiring additional training, allowing for immediate deployment of learned skills. Key contributions of the work include: 1. **Efficient Imitation Learning**: KAT achieves state-of-the-art results in few-shot imitation learning on various everyday tasks. 2. **No Additional Training**: Unlike previous methods, KAT does not require fine-tuning on robotics data, making it highly efficient. 3. **General Pattern Learning**: The method leverages the pattern learning capabilities of large Transformers, which are pre-trained on abundant textual data, to handle low-data regimes in robotics. The paper also explores the optimal design of keypoint and action tokens, including the number of keypoints and actions, and the representation of end-effector poses. Experiments using a Sawyer robot demonstrate the effectiveness of KAT, showing superior performance compared to state-of-the-art methods like Diffusion Policies. Additionally, the paper discusses the robustness of KAT to visual distractors and the impact of different vision models on keypoint extraction. The authors conclude that the evolution of large Transformers, trained on language data, can unexpectedly benefit robotics, where data is scarce, and suggest that future models will likely improve further in few-shot sequence-to-sequence pattern recognition and generation.The paper introduces Keypoint Action Tokens (KAT), a method that repurposes large text-pretrained Transformers for efficient imitation learning in robotics. KAT transforms visual observations and action trajectories into sequences of keypoint-based tokens, which are then converted into text and added to a prompt. This approach enables the Transformers to learn general behaviors from as few as 10 demonstrations without requiring additional training, allowing for immediate deployment of learned skills. Key contributions of the work include: 1. **Efficient Imitation Learning**: KAT achieves state-of-the-art results in few-shot imitation learning on various everyday tasks. 2. **No Additional Training**: Unlike previous methods, KAT does not require fine-tuning on robotics data, making it highly efficient. 3. **General Pattern Learning**: The method leverages the pattern learning capabilities of large Transformers, which are pre-trained on abundant textual data, to handle low-data regimes in robotics. The paper also explores the optimal design of keypoint and action tokens, including the number of keypoints and actions, and the representation of end-effector poses. Experiments using a Sawyer robot demonstrate the effectiveness of KAT, showing superior performance compared to state-of-the-art methods like Diffusion Policies. Additionally, the paper discusses the robustness of KAT to visual distractors and the impact of different vision models on keypoint extraction. The authors conclude that the evolution of large Transformers, trained on language data, can unexpectedly benefit robotics, where data is scarce, and suggest that future models will likely improve further in few-shot sequence-to-sequence pattern recognition and generation.

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

28 Mar 2024 | Norman Di Palo, Edward Johns