4 Apr 2024 | Jake Varley, Sumeet Singh, Deepli Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, Vikas Sindhwani
This paper presents a modular, zero-shot, safe bi-arm embodied AI system that enables a robot to perform complex tasks based on natural language instructions. The system integrates state-of-the-art Large Language Models (LLMs) for task planning, Vision-Language Models (VLMs) for perception, and Point Cloud Transformers for grasping. It also includes a constrained trajectory optimizer and a compliant tracking controller to ensure safety and human-robot proximity. The system is designed to be modular, allowing for easy debugging and replacement of components.
The system is tested on three tasks: bi-arm sorting, bottle opening, and trash disposal. These tasks require coordination between both arms and involve semantic and physical safety constraints. The system demonstrates zero-shot performance, meaning it can perform tasks without prior training on the specific robot or environment. It also incorporates various safety mechanisms, such as avoiding collisions and ensuring proper object handling.
The system's modularity allows for the integration of different safety modalities and enables the system to adapt to new tasks by replacing or augmenting modules with learned policies. The LLM is used to generate high-level commands, which are then executed by a state-machine based on pre-defined bi-arm Skills. These Skills are combined with perception data and motion planning to generate joint-space trajectories for both arms.
The system's performance is evaluated on several tasks, showing high success rates and the ability to handle complex scenarios. The system's modular design allows for easy debugging and improvement, making it suitable for real-world applications with complex instructions and safety constraints. The system's zero-shot capability and modular structure make it a promising approach for future embodied AI systems.This paper presents a modular, zero-shot, safe bi-arm embodied AI system that enables a robot to perform complex tasks based on natural language instructions. The system integrates state-of-the-art Large Language Models (LLMs) for task planning, Vision-Language Models (VLMs) for perception, and Point Cloud Transformers for grasping. It also includes a constrained trajectory optimizer and a compliant tracking controller to ensure safety and human-robot proximity. The system is designed to be modular, allowing for easy debugging and replacement of components.
The system is tested on three tasks: bi-arm sorting, bottle opening, and trash disposal. These tasks require coordination between both arms and involve semantic and physical safety constraints. The system demonstrates zero-shot performance, meaning it can perform tasks without prior training on the specific robot or environment. It also incorporates various safety mechanisms, such as avoiding collisions and ensuring proper object handling.
The system's modularity allows for the integration of different safety modalities and enables the system to adapt to new tasks by replacing or augmenting modules with learned policies. The LLM is used to generate high-level commands, which are then executed by a state-machine based on pre-defined bi-arm Skills. These Skills are combined with perception data and motion planning to generate joint-space trajectories for both arms.
The system's performance is evaluated on several tasks, showing high success rates and the ability to handle complex scenarios. The system's modular design allows for easy debugging and improvement, making it suitable for real-world applications with complex instructions and safety constraints. The system's zero-shot capability and modular structure make it a promising approach for future embodied AI systems.