Understanding MOSAIC%3A A Modular System for Assistive and Interactive Cooking

**MOSAIC: A Modular System for Assistive and Interactive Cooking** **Authors:** Huaxiaoyue Wang, Kushal Kedia, Juntao Ren, Rahma Abdullah, Atiksh Bhardwaj, Angela Chao, Kelly Y Chen, Nathaniel Chin, Prithwish Dan, Xinyi Fan, Gonzalo Gonzalez-Pumariega, Aditya Kompella, Maximus Adrian Pace, Yash Sharma, Xiangwan Sun, Neha Sunkara, Sanjiban Choudhury **Institution:** Cornell University **Project Website:** [https://portal-cornell.github.io/MOSAIC/](https://portal-cornell.github.io/MOSAIC/) **Abstract:** MOSAIC is a modular architecture designed for home robots to perform complex collaborative tasks, such as cooking with everyday users. It collaborates closely with humans, interacts via natural language, coordinates multiple robots, and manages an open vocabulary of everyday objects. The core of MOSAIC is modularity, leveraging large-scale pre-trained models for general tasks like language and image recognition, while using streamlined modules for task-specific control. Extensive evaluations show that MOSAIC can efficiently collaborate with humans, completing 68.3% of 60 end-to-end trials of 6 different recipes with a subtask completion rate of 91.6%. **Key Contributions:** 1. **Interactive Task Planner:** Embeds Large Language Models (LLMs) within a behavior tree to reduce complexity and error rates. 2. **Visuomotor Skills:** Uses pre-trained vision-language models for object identification and RL-trained policies for action selection. 3. **Human Motion Forecasting:** Develops a method to forecast human motion, enabling safe and fluid collaboration. 4. **Comprehensive Evaluation:** Conducts 60 end-to-end trials, 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations. **Problem Statement:** The system focuses on collaborative cooking tasks in a kitchen environment, where a human user interacts with two robots (one mobile and one tabletop) via natural language to prepare meals. Assumptions include access to seed recipes, a kitchen map, full observability, and a skills API. ** Approach:** MOSAIC integrates multiple large-scale pre-trained models to solve collaborative cooking tasks. It consists of three main components: 1. **Interactive Task Planner:** Plans tasks and subtasks using natural language and coordinates them with robots. 2. **Visuomotor Skills:** Executes subtasks using pre-trained vision-language models and RL-trained policies. 3. **Human Motion Forecasting:** Predicts human motion to ensure safe and fluid collaboration. **Experiments:** - **End-to-end Trials:** 60 trials with 2 robots and 1 user, completing 6 different recipes.**MOSAIC: A Modular System for Assistive and Interactive Cooking** **Authors:** Huaxiaoyue Wang, Kushal Kedia, Juntao Ren, Rahma Abdullah, Atiksh Bhardwaj, Angela Chao, Kelly Y Chen, Nathaniel Chin, Prithwish Dan, Xinyi Fan, Gonzalo Gonzalez-Pumariega, Aditya Kompella, Maximus Adrian Pace, Yash Sharma, Xiangwan Sun, Neha Sunkara, Sanjiban Choudhury **Institution:** Cornell University **Project Website:** [https://portal-cornell.github.io/MOSAIC/](https://portal-cornell.github.io/MOSAIC/) **Abstract:** MOSAIC is a modular architecture designed for home robots to perform complex collaborative tasks, such as cooking with everyday users. It collaborates closely with humans, interacts via natural language, coordinates multiple robots, and manages an open vocabulary of everyday objects. The core of MOSAIC is modularity, leveraging large-scale pre-trained models for general tasks like language and image recognition, while using streamlined modules for task-specific control. Extensive evaluations show that MOSAIC can efficiently collaborate with humans, completing 68.3% of 60 end-to-end trials of 6 different recipes with a subtask completion rate of 91.6%. **Key Contributions:** 1. **Interactive Task Planner:** Embeds Large Language Models (LLMs) within a behavior tree to reduce complexity and error rates. 2. **Visuomotor Skills:** Uses pre-trained vision-language models for object identification and RL-trained policies for action selection. 3. **Human Motion Forecasting:** Develops a method to forecast human motion, enabling safe and fluid collaboration. 4. **Comprehensive Evaluation:** Conducts 60 end-to-end trials, 180 episodes of visuomotor picking, 60 episodes of human motion forecasting, and 46 online user evaluations. **Problem Statement:** The system focuses on collaborative cooking tasks in a kitchen environment, where a human user interacts with two robots (one mobile and one tabletop) via natural language to prepare meals. Assumptions include access to seed recipes, a kitchen map, full observability, and a skills API. ** Approach:** MOSAIC integrates multiple large-scale pre-trained models to solve collaborative cooking tasks. It consists of three main components: 1. **Interactive Task Planner:** Plans tasks and subtasks using natural language and coordinates them with robots. 2. **Visuomotor Skills:** Executes subtasks using pre-trained vision-language models and RL-trained policies. 3. **Human Motion Forecasting:** Predicts human motion to ensure safe and fluid collaboration. **Experiments:** - **End-to-end Trials:** 60 trials with 2 robots and 1 user, completing 6 different recipes.

MOSAIC: A Modular System for Assistive and Interactive Cooking

29 Feb 2024 | Huaxiaoyue Wang*, Kushal Kedia*, Juntao Ren*, Rahma Abdullah, Atiksh Bhardwaj, Angela Chao, Kelly Y Chen, Nathaniel Chin, Prithwish Dan, Xinyi Fan, Gonzalo Gonzalez-Pumariega, Aditya Kompella, Maximus Adrian Pace, Yash Sharma, Xiangwan Sun, Neha Sunkara, Sanjiban Choudhury

29 Feb 2024 | Huaxiaoyue Wang, Kushal Kedia, Juntao Ren*, Rahma Abdullah, Atiksh Bhardwaj, Angela Chao, Kelly Y Chen, Nathaniel Chin, Prithwish Dan, Xinyi Fan, Gonzalo Gonzalez-Pumariega, Aditya Kompella, Maximus Adrian Pace, Yash Sharma, Xiangwan Sun, Neha Sunkara, Sanjiban Choudhury