PaLM-E: An Embodied Multimodal Language Model

PaLM-E: An Embodied Multimodal Language Model

6 Mar 2023 | Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Azyaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Igor Mordatch, Pete Florence
PaLM-E is a general-purpose multimodal language model designed for embodied reasoning tasks, visual-language tasks, and language tasks. It integrates real-world continuous sensor modalities into language models, bridging the gap between words and percepts. PaLM-E processes multi-modal sentences, where inputs from various modalities (e.g., images, neural 3D representations, or states) are interleaved with text tokens. The model is trained end-to-end, combining a pre-trained large language model with encoders for continuous inputs, to perform tasks such as sequential robotic manipulation planning, visual question answering, and captioning. Evaluations show that PaLM-E can address a variety of embodied reasoning tasks from different observation modalities and exhibits positive transfer, benefiting from diverse joint training across internet-scale language, vision, and visual-language domains. The largest model, PaLM-E-562B, achieves state-of-the-art performance on OK-VQA and retains generalist language capabilities with increasing scale. PaLM-E demonstrates capabilities such as zero-shot multimodal chain-of-thought reasoning, few-shot prompting, OCR-free math reasoning, and multi-image reasoning, despite being trained on only single-image examples. The main contributions include proposing embodied language models, showing that a general-purpose visual-language model can be an efficient embodied reasoner, introducing novel architectural ideas, and demonstrating that scaling the language model size enables multimodal finetuning with less catastrophic forgetting.PaLM-E is a general-purpose multimodal language model designed for embodied reasoning tasks, visual-language tasks, and language tasks. It integrates real-world continuous sensor modalities into language models, bridging the gap between words and percepts. PaLM-E processes multi-modal sentences, where inputs from various modalities (e.g., images, neural 3D representations, or states) are interleaved with text tokens. The model is trained end-to-end, combining a pre-trained large language model with encoders for continuous inputs, to perform tasks such as sequential robotic manipulation planning, visual question answering, and captioning. Evaluations show that PaLM-E can address a variety of embodied reasoning tasks from different observation modalities and exhibits positive transfer, benefiting from diverse joint training across internet-scale language, vision, and visual-language domains. The largest model, PaLM-E-562B, achieves state-of-the-art performance on OK-VQA and retains generalist language capabilities with increasing scale. PaLM-E demonstrates capabilities such as zero-shot multimodal chain-of-thought reasoning, few-shot prompting, OCR-free math reasoning, and multi-image reasoning, despite being trained on only single-image examples. The main contributions include proposing embodied language models, showing that a general-purpose visual-language model can be an efficient embodied reasoner, introducing novel architectural ideas, and demonstrating that scaling the language model size enables multimodal finetuning with less catastrophic forgetting.
Reach us at info@study.space
[slides and audio] PaLM-E%3A An Embodied Multimodal Language Model