[slides and audio] MultiPLY%3A A Multisensory Object-Centric Embodied Large Language Model in 3D World

**MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World** **Authors:** Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan **Institution:** UMass Amherst, UCLA, MIT-IBM Watson AI Lab **Abstract:** MultiPLY is a novel multisensory embodied large language model (LLM) designed to encode object-centric multisensory representations (visual, audio, tactile, and thermal) by deploying an embodied agent in a 3D environment. It excels in various tasks such as multisensory captioning, question answering, dialogue, manipulation, navigation, tool use, and task decomposition. The model is trained on the Multisensory Universe dataset, which consists of 500k data points collected by an embodied agent interacting with 3D environments. MultiPLY encodes 3D scenes as abstracted object-centric representations and introduces action tokens to denote agent actions and state tokens to represent multisensory observations. During inference, MultiPLY generates action tokens, instructs the agent to take actions, and appends the resulting observations back to the LLM for further generation. Experimental results show that MultiPLY outperforms baselines in diverse embodied tasks. **Contributions:** - **Multisensory Universe:** A large-scale dataset of 500k multisensory interaction data collected by an embodied agent. - **MultiPLY:** A multisensory embodied LLM that encodes multisensory object-centric representations and performs instruction tuning on pre-trained LLMs. - **Performance:** Significant improvements in object retrieval, tool use, multisensory captioning, and task decomposition compared to baselines. **Introduction:** Humans interact with the 3D world using multisensory cues, but current multi-modal LLMs lack the ability to actively interact with objects and collect multisensory information. MultiPLY addresses this by encoding multisensory data and enabling end-to-end instruction tuning. The dataset, Multisensory Universe, is constructed by adding interactive objects to 3D scenes and collecting sensor data through an embodied agent. The model encodes 3D scenes as object-centric representations and introduces action and state tokens for interaction. Inference involves generating action tokens, instructing the agent, and appending observations back to the LLM. **Methods:** - **Object-Centric Scene Representations:** Encodes 3D scenes using concept graphs and CLIP encoders. - **Action Tokens:** Denote agent actions such as selecting, observing, touching, and hitting objects. - **State Tokens:** Represent multisensory observations like object points, impact sounds, tactile information, and temperature. **Training and Inference:** - **Training:** Uses LLaVA as the backbone model and aligns sensor data with language features. - **Inference:** Generates action tokens, instruct**MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World** **Authors:** Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan **Institution:** UMass Amherst, UCLA, MIT-IBM Watson AI Lab **Abstract:** MultiPLY is a novel multisensory embodied large language model (LLM) designed to encode object-centric multisensory representations (visual, audio, tactile, and thermal) by deploying an embodied agent in a 3D environment. It excels in various tasks such as multisensory captioning, question answering, dialogue, manipulation, navigation, tool use, and task decomposition. The model is trained on the Multisensory Universe dataset, which consists of 500k data points collected by an embodied agent interacting with 3D environments. MultiPLY encodes 3D scenes as abstracted object-centric representations and introduces action tokens to denote agent actions and state tokens to represent multisensory observations. During inference, MultiPLY generates action tokens, instructs the agent to take actions, and appends the resulting observations back to the LLM for further generation. Experimental results show that MultiPLY outperforms baselines in diverse embodied tasks. **Contributions:** - **Multisensory Universe:** A large-scale dataset of 500k multisensory interaction data collected by an embodied agent. - **MultiPLY:** A multisensory embodied LLM that encodes multisensory object-centric representations and performs instruction tuning on pre-trained LLMs. - **Performance:** Significant improvements in object retrieval, tool use, multisensory captioning, and task decomposition compared to baselines. **Introduction:** Humans interact with the 3D world using multisensory cues, but current multi-modal LLMs lack the ability to actively interact with objects and collect multisensory information. MultiPLY addresses this by encoding multisensory data and enabling end-to-end instruction tuning. The dataset, Multisensory Universe, is constructed by adding interactive objects to 3D scenes and collecting sensor data through an embodied agent. The model encodes 3D scenes as object-centric representations and introduces action and state tokens for interaction. Inference involves generating action tokens, instructing the agent, and appending observations back to the LLM. **Methods:** - **Object-Centric Scene Representations:** Encodes 3D scenes using concept graphs and CLIP encoders. - **Action Tokens:** Denote agent actions such as selecting, observing, touching, and hitting objects. - **State Tokens:** Represent multisensory observations like object points, impact sounds, tactile information, and temperature. **Training and Inference:** - **Training:** Uses LLaVA as the backbone model and aligns sensor data with language features. - **Inference:** Generates action tokens, instruct

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

16 Jan 2024 | Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan