LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

May 11–16, 2024 | Chao Wang, Stephan Hasler, Daniel Tanneberg, Felix Ocker, Frank Joublin, Antonello Ceravola, Joerg Deigmoeller, Michael Gienger
LaMI is a novel robotic system that uses large language models (LLMs) to enhance multi-modal human-robot interaction (HRI). The system allows researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomic" actions and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach. The system includes three key modules: "Scene Narrator", "Planner", and "Expresser". The "Scene Narrator" module senses the poses of objects, human postures, and dialogue information, constructing a 3D representation of the actual scene. The "Planner" module processes multi-modal inputs as event messages, encompassing the positions of individuals within the scene. The "Expresser" module controls the actuators responsible for the robot's facial expressions, housing a library of pre-designed "atomic animation clips" for each actuator's movements. The system also incorporates a rule-based mechanism to provide rapid expressions in the interim between the request and response of each GPT query. The system's evaluation setup includes a test scenario with two participants, "Daniel" and "Felix", seated around a table with various objects. The robot is designed to detect their head orientation, speech, and actions, even when objects obstruct its view. The participants interact with these objects and each other, following scripted scenarios to test the robot's multi-modal reasoning and expression capabilities. The system's preliminary test results demonstrate that the robot can effectively meet researcher expectations, suggesting that this approach holds the potential to transform human-robot interaction from a manual, state-and-flow design methodology to a more intuitive approach centered around guidance, capabilities, and example-driven frameworks.LaMI is a novel robotic system that uses large language models (LLMs) to enhance multi-modal human-robot interaction (HRI). The system allows researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomic" actions and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach. The system includes three key modules: "Scene Narrator", "Planner", and "Expresser". The "Scene Narrator" module senses the poses of objects, human postures, and dialogue information, constructing a 3D representation of the actual scene. The "Planner" module processes multi-modal inputs as event messages, encompassing the positions of individuals within the scene. The "Expresser" module controls the actuators responsible for the robot's facial expressions, housing a library of pre-designed "atomic animation clips" for each actuator's movements. The system also incorporates a rule-based mechanism to provide rapid expressions in the interim between the request and response of each GPT query. The system's evaluation setup includes a test scenario with two participants, "Daniel" and "Felix", seated around a table with various objects. The robot is designed to detect their head orientation, speech, and actions, even when objects obstruct its view. The participants interact with these objects and each other, following scripted scenarios to test the robot's multi-modal reasoning and expression capabilities. The system's preliminary test results demonstrate that the robot can effectively meet researcher expectations, suggesting that this approach holds the potential to transform human-robot interaction from a manual, state-and-flow design methodology to a more intuitive approach centered around guidance, capabilities, and example-driven frameworks.
Reach us at info@study.space
[slides and audio] LaMI%3A Large Language Models for Multi-Modal Human-Robot Interaction