6 Mar 2023 | Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Azyaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Igor Mordatch, Pete Florence
PaLM-E is an embodied multimodal language model that integrates continuous sensor data into language models to enable grounded reasoning in real-world tasks. It processes multimodal inputs, including images, neural 3D representations, and text, to perform tasks like visual question answering, robotic planning, and language understanding. PaLM-E is trained end-to-end with a pre-trained large language model, allowing it to handle multiple embodied tasks across various modalities and embodiments. The model demonstrates positive transfer across domains, achieving state-of-the-art performance on the OK-VQA benchmark without task-specific fine-tuning. PaLM-E-562B, the largest version, is a visual-language generalist with 562 billion parameters, capable of zero-shot reasoning, multi-image tasks, and OCR-free math reasoning. It also excels in real-world robotic tasks, such as planning and control, and can generalize to novel scenarios with few examples. The model's architecture includes neural scene representations and entity-labeling tokens, enabling it to process complex visual and linguistic information. PaLM-E is trained on diverse tasks across multiple robot embodiments and general vision-language data, leading to significant performance improvements and cross-domain transfer. The model's ability to retain language capabilities during multimodal training, even when freezing the language model, highlights its effectiveness in embodied reasoning and general-purpose vision-language tasks.PaLM-E is an embodied multimodal language model that integrates continuous sensor data into language models to enable grounded reasoning in real-world tasks. It processes multimodal inputs, including images, neural 3D representations, and text, to perform tasks like visual question answering, robotic planning, and language understanding. PaLM-E is trained end-to-end with a pre-trained large language model, allowing it to handle multiple embodied tasks across various modalities and embodiments. The model demonstrates positive transfer across domains, achieving state-of-the-art performance on the OK-VQA benchmark without task-specific fine-tuning. PaLM-E-562B, the largest version, is a visual-language generalist with 562 billion parameters, capable of zero-shot reasoning, multi-image tasks, and OCR-free math reasoning. It also excels in real-world robotic tasks, such as planning and control, and can generalize to novel scenarios with few examples. The model's architecture includes neural scene representations and entity-labeling tokens, enabling it to process complex visual and linguistic information. PaLM-E is trained on diverse tasks across multiple robot embodiments and general vision-language data, leading to significant performance improvements and cross-domain transfer. The model's ability to retain language capabilities during multimodal training, even when freezing the language model, highlights its effectiveness in embodied reasoning and general-purpose vision-language tasks.