Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations

Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations

19 Aug 2024 | Koffivi Fidèle Gbagbe, Miguel Altamirano Cabrera, Ali Alabbas, Oussama Alyunes, Artem Lykov, and Dzmitry Tsetserukou
The Bi-VLA system is a novel vision-language-action model designed for bimanual robotic dexterous manipulation. It integrates vision for scene understanding, language comprehension for translating human instructions into executable code, and physical action generation. The system was tested on household tasks, including preparing a desired salad based on human requests. The Language Module achieved a 100% success rate in generating correct executable code, the Vision Module achieved a 96.06% success rate in detecting specific ingredients, and the overall task success rate was 83.4%. The system uses a large language model (LLM) as a semantic planner to coordinate the actions of robotic arms. The LLM generates a plan for coordinating the movements of the two robotic arms with different tools. The plan is then translated into API function calls, which are executed by the robots. The Vision Language Module (VLM) is used to verify the availability of required items and provide 2D-pixel coordinates of these items. The VLM also maps image pixel coordinates to 3D world coordinates, enabling the robot to grasp and manipulate objects accurately. The system was evaluated through experiments involving three types of salad preparation: vegetable, Russian, and fruit. The experiments demonstrated the system's ability to accurately interpret human instructions, perceive the visual context of the scene, and execute the required actions. The results showed that the system can successfully complete tasks with high accuracy and efficiency. The success of the system depends on the performance of the language and vision modules, which are crucial for overall system performance. Future work will focus on enhancing the robustness and versatility of the vision module to better handle a variety of visual variations and uncertainties.The Bi-VLA system is a novel vision-language-action model designed for bimanual robotic dexterous manipulation. It integrates vision for scene understanding, language comprehension for translating human instructions into executable code, and physical action generation. The system was tested on household tasks, including preparing a desired salad based on human requests. The Language Module achieved a 100% success rate in generating correct executable code, the Vision Module achieved a 96.06% success rate in detecting specific ingredients, and the overall task success rate was 83.4%. The system uses a large language model (LLM) as a semantic planner to coordinate the actions of robotic arms. The LLM generates a plan for coordinating the movements of the two robotic arms with different tools. The plan is then translated into API function calls, which are executed by the robots. The Vision Language Module (VLM) is used to verify the availability of required items and provide 2D-pixel coordinates of these items. The VLM also maps image pixel coordinates to 3D world coordinates, enabling the robot to grasp and manipulate objects accurately. The system was evaluated through experiments involving three types of salad preparation: vegetable, Russian, and fruit. The experiments demonstrated the system's ability to accurately interpret human instructions, perceive the visual context of the scene, and execute the required actions. The results showed that the system can successfully complete tasks with high accuracy and efficiency. The success of the system depends on the performance of the language and vision modules, which are crucial for overall system performance. Future work will focus on enhancing the robustness and versatility of the vision module to better handle a variety of visual variations and uncertainties.
Reach us at info@study.space