29 Feb 2024 | Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik
The paper presents a novel approach to real-world humanoid locomotion by framing it as a next token prediction problem, similar to language modeling. The authors train a causal transformer model to autoregressively predict sensorimotor trajectories, which are collected from various sources such as neural network policies, model-based controllers, motion capture data, and YouTube videos. The model is trained to predict complete input sequences, including both sensory and motor tokens, allowing it to handle incomplete data and leverage diverse modalities. The trained model is tested on a full-sized humanoid robot, enabling zero-shot walking in real-world environments like San Francisco. The results show that the model can generalize to unseen commands, such as walking backward, and perform well even with limited training data. The approach demonstrates a promising path for learning challenging real-world control tasks through generative modeling of sensorimotor trajectories.The paper presents a novel approach to real-world humanoid locomotion by framing it as a next token prediction problem, similar to language modeling. The authors train a causal transformer model to autoregressively predict sensorimotor trajectories, which are collected from various sources such as neural network policies, model-based controllers, motion capture data, and YouTube videos. The model is trained to predict complete input sequences, including both sensory and motor tokens, allowing it to handle incomplete data and leverage diverse modalities. The trained model is tested on a full-sized humanoid robot, enabling zero-shot walking in real-world environments like San Francisco. The results show that the model can generalize to unseen commands, such as walking backward, and perform well even with limited training data. The approach demonstrates a promising path for learning challenging real-world control tasks through generative modeling of sensorimotor trajectories.