A Survey on Vision-Language-Action Models for Embodied AI

A Survey on Vision-Language-Action Models for Embodied AI

23 May 2024 | Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, Irwin King, Fellow, IEEE
This survey provides an overview of vision-language-action (VLA) models in embodied AI, highlighting their role in enabling robots to understand and execute complex tasks by integrating vision, language, and action modalities. VLAs have emerged as a critical component in embodied AI, offering superior versatility, dexterity, and generalizability compared to traditional reinforcement learning approaches. The survey discusses the evolution of unimodal models, the development of vision-language models, and the integration of these into VLA frameworks. It categorizes VLAs into three main components: pretraining, control policy, and task planner, each contributing to the overall effectiveness of robotic systems. The survey also explores various pretraining methods, including contrastive learning, masked autoencoding, and world model learning, which enhance the model's ability to understand and predict environmental dynamics. Additionally, it examines different types of low-level control policies, including non-Transformer, Transformer-based, and large language model (LLM)-based approaches, each with unique strengths and applications. The survey also addresses challenges in VLA development, such as data scarcity, robot dexterity, and generalization across tasks and environments. It concludes with future directions for research in VLA models, emphasizing the need for further advancements in these areas to improve the capabilities of embodied AI systems.This survey provides an overview of vision-language-action (VLA) models in embodied AI, highlighting their role in enabling robots to understand and execute complex tasks by integrating vision, language, and action modalities. VLAs have emerged as a critical component in embodied AI, offering superior versatility, dexterity, and generalizability compared to traditional reinforcement learning approaches. The survey discusses the evolution of unimodal models, the development of vision-language models, and the integration of these into VLA frameworks. It categorizes VLAs into three main components: pretraining, control policy, and task planner, each contributing to the overall effectiveness of robotic systems. The survey also explores various pretraining methods, including contrastive learning, masked autoencoding, and world model learning, which enhance the model's ability to understand and predict environmental dynamics. Additionally, it examines different types of low-level control policies, including non-Transformer, Transformer-based, and large language model (LLM)-based approaches, each with unique strengths and applications. The survey also addresses challenges in VLA development, such as data scarcity, robot dexterity, and generalization across tasks and environments. It concludes with future directions for research in VLA models, emphasizing the need for further advancements in these areas to improve the capabilities of embodied AI systems.
Reach us at info@study.space
[slides and audio] A Survey on Vision-Language-Action Models for Embodied AI