MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

16 Jan 2024 | Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan
MultiPLY is a multisensory embodied large language model (LLM) that integrates visual, audio, tactile, and thermal information to perform tasks in a 3D environment. The model is trained on the Multisensory Universe dataset, which contains 500,000 interactions generated by an LLM-powered agent exploring 3D environments. The dataset includes a variety of tasks such as object retrieval, tool use, multisensory captioning, and task decomposition. MultiPLY encodes the 3D environment as an abstracted object-centric representation and uses action tokens to denote interactions with objects, while state tokens encode sensory observations. During inference, the model generates action tokens to interact with the environment and obtain sensory feedback, which is then used to generate subsequent text or action tokens. The model outperforms baselines in multiple tasks, demonstrating its ability to reason and interact with the 3D world. The paper introduces the Multisensory Universe dataset and the MultiPLY framework, which combines multisensory information to enable embodied reasoning. The model uses a combination of visual, audio, tactile, and thermal data, and is trained on a large-scale dataset to improve its ability to interact with and reason about the 3D environment. The model's architecture includes a vision-language model backbone, with additional modules for handling tactile, sound, and temperature data. The model is evaluated on several tasks, showing its effectiveness in reasoning and interaction. The paper also discusses related works in multisensory learning and multi-modal LLMs, highlighting the importance of active interaction with the environment for effective reasoning. The study contributes a new dataset and a novel framework for embodied reasoning in 3D environments.MultiPLY is a multisensory embodied large language model (LLM) that integrates visual, audio, tactile, and thermal information to perform tasks in a 3D environment. The model is trained on the Multisensory Universe dataset, which contains 500,000 interactions generated by an LLM-powered agent exploring 3D environments. The dataset includes a variety of tasks such as object retrieval, tool use, multisensory captioning, and task decomposition. MultiPLY encodes the 3D environment as an abstracted object-centric representation and uses action tokens to denote interactions with objects, while state tokens encode sensory observations. During inference, the model generates action tokens to interact with the environment and obtain sensory feedback, which is then used to generate subsequent text or action tokens. The model outperforms baselines in multiple tasks, demonstrating its ability to reason and interact with the 3D world. The paper introduces the Multisensory Universe dataset and the MultiPLY framework, which combines multisensory information to enable embodied reasoning. The model uses a combination of visual, audio, tactile, and thermal data, and is trained on a large-scale dataset to improve its ability to interact with and reason about the 3D environment. The model's architecture includes a vision-language model backbone, with additional modules for handling tactile, sound, and temperature data. The model is evaluated on several tasks, showing its effectiveness in reasoning and interaction. The paper also discusses related works in multisensory learning and multi-modal LLMs, highlighting the importance of active interaction with the environment for effective reasoning. The study contributes a new dataset and a novel framework for embodied reasoning in 3D environments.
Reach us at info@study.space
[slides] MultiPLY%3A A Multisensory Object-Centric Embodied Large Language Model in 3D World | StudySpace