May 11–16, 2024, Honolulu, HI, USA | Chao Wang, Stephan Hasler, Daniel Tanneberg, Felix Ocker, Frank Joublin, Antonello Ceravola, Joerg Deimgoeller, Michael Gienger
This paper introduces an innovative large language model (LLM)-based robotic system designed to enhance multi-modal human-robot interaction (HRI). Traditional HRI systems often relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, the proposed system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomic actions" and expressions, and offering a set of examples. The system is implemented on a physical robot and demonstrates proficiency in adapting to multi-modal inputs, determining appropriate actions, and coordinating the robot's movements with speech output to produce dynamic, multi-modal expressions. This approach shifts HRI from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven framework. The system's architecture includes three modules: "Scene Narrator," "Planner," and "Expresser," which work together to process multi-modal inputs, plan actions, and control the robot's expressions. Preliminary tests show that the robot can effectively meet researcher expectations, suggesting the potential to revolutionize HRI. The paper also discusses the configuration space for HRI, including high-level guidance, atomic actions, and examples, and highlights the importance of rule-based reactive expressions to improve user interaction.This paper introduces an innovative large language model (LLM)-based robotic system designed to enhance multi-modal human-robot interaction (HRI). Traditional HRI systems often relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, the proposed system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomic actions" and expressions, and offering a set of examples. The system is implemented on a physical robot and demonstrates proficiency in adapting to multi-modal inputs, determining appropriate actions, and coordinating the robot's movements with speech output to produce dynamic, multi-modal expressions. This approach shifts HRI from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven framework. The system's architecture includes three modules: "Scene Narrator," "Planner," and "Expresser," which work together to process multi-modal inputs, plan actions, and control the robot's expressions. Preliminary tests show that the robot can effectively meet researcher expectations, suggesting the potential to revolutionize HRI. The paper also discusses the configuration space for HRI, including high-level guidance, atomic actions, and examples, and highlights the importance of rule-based reactive expressions to improve user interaction.