Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations

Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations

19 Aug 2024 | Koffivi Fidèle Gbagbe, Miguel Altamirano Cabrera, Ali Alabbas, Oussama Alyunes, Artem Lykov, and Dzmitry Tsetserukou
The Bi-VLA (Vision-Language-Action) model is a novel system designed for bimanual robotic dexterous manipulation, integrating vision, language comprehension, and physical action generation. The system aims to seamlessly translate human instructions into executable code and execute precise bimanual actions. The research evaluates the system's functionality through household tasks, such as preparing a desired salad, demonstrating its ability to interpret complex human instructions, perceive visual context, and execute precise actions. The system achieved a 100% success rate in generating correct executable code, a 96.06% success rate in detecting specific ingredients, and an overall 83.4% success rate in executing user-requested tasks. The study highlights the importance of advancements in visual understanding to enhance task completion accuracy and suggests future efforts to improve the robustness and versatility of the vision module.The Bi-VLA (Vision-Language-Action) model is a novel system designed for bimanual robotic dexterous manipulation, integrating vision, language comprehension, and physical action generation. The system aims to seamlessly translate human instructions into executable code and execute precise bimanual actions. The research evaluates the system's functionality through household tasks, such as preparing a desired salad, demonstrating its ability to interpret complex human instructions, perceive visual context, and execute precise actions. The system achieved a 100% success rate in generating correct executable code, a 96.06% success rate in detecting specific ingredients, and an overall 83.4% success rate in executing user-requested tasks. The study highlights the importance of advancements in visual understanding to enhance task completion accuracy and suggests future efforts to improve the robustness and versatility of the vision module.
Reach us at info@study.space
[slides and audio] Bi-VLA%3A Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations