[slides] Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

This paper introduces a language-conditioned robotic manipulation framework called RFST (Robotics with Fast and Slow Thinking), which mimics human dual-process theory to handle tasks requiring both fast and slow thinking. RFST consists of two systems: a fast-thinking system for simple tasks and a slow-thinking system for complex tasks requiring reasoning and intent recognition. The fast-thinking system uses a simple policy network, while the slow-thinking system employs a fine-tuned vision-language model (VLM) to perform reasoning and intent recognition. The framework classifies tasks based on the user's instruction type and uses a Think Bank to store intermediate steps for problem-solving. The paper presents a dataset of real-world trajectories for nine tasks, including three for fast-thinking and six for slow-thinking systems. The results show that RFST outperforms existing methods in both simulation and real-world scenarios, particularly in complex tasks requiring reasoning and intent recognition. The framework is evaluated on tasks such as mathematical reasoning, word correction, sorting cubes by color, and intent recognition. The methodology involves using a vision-language model to generate step-by-step plans for complex tasks, which are then fed into a policy network to execute robotic actions. The framework also incorporates CLIP for aligning visual inputs with text descriptions and fine-tunes the model using a limited dataset. The results demonstrate that RFST can effectively handle both fast and slow thinking tasks, with high success rates in reasoning and intent recognition. The paper concludes that RFST provides a unified framework for robotic manipulation that can handle both simple and complex tasks, drawing inspiration from human cognitive processes.This paper introduces a language-conditioned robotic manipulation framework called RFST (Robotics with Fast and Slow Thinking), which mimics human dual-process theory to handle tasks requiring both fast and slow thinking. RFST consists of two systems: a fast-thinking system for simple tasks and a slow-thinking system for complex tasks requiring reasoning and intent recognition. The fast-thinking system uses a simple policy network, while the slow-thinking system employs a fine-tuned vision-language model (VLM) to perform reasoning and intent recognition. The framework classifies tasks based on the user's instruction type and uses a Think Bank to store intermediate steps for problem-solving. The paper presents a dataset of real-world trajectories for nine tasks, including three for fast-thinking and six for slow-thinking systems. The results show that RFST outperforms existing methods in both simulation and real-world scenarios, particularly in complex tasks requiring reasoning and intent recognition. The framework is evaluated on tasks such as mathematical reasoning, word correction, sorting cubes by color, and intent recognition. The methodology involves using a vision-language model to generate step-by-step plans for complex tasks, which are then fed into a policy network to execute robotic actions. The framework also incorporates CLIP for aligning visual inputs with text descriptions and fine-tunes the model using a limited dataset. The results demonstrate that RFST can effectively handle both fast and slow thinking tasks, with high success rates in reasoning and intent recognition. The paper concludes that RFST provides a unified framework for robotic manipulation that can handle both simple and complex tasks, drawing inspiration from human cognitive processes.

Language-Conditioned Robotic Manipulation with Fast and Slow Thinking

1 Feb 2024 | Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Zhengping Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, and Jian Tang