May 6 - 10, 2024 | Tan Zhi-Xuan, Lance Ying, Vikash Mansinghka, Joshua B. Tenenbaum
This paper introduces CLIPS, a Bayesian architecture for pragmatic instruction following and goal assistance. CLIPS models humans as cooperative planners who communicate joint plans as instructions. It performs multimodal Bayesian inference over the human's goal from actions and language, using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan. Given this posterior, the assistant acts to minimize expected goal achievement cost, enabling it to pragmatically follow ambiguous instructions and provide effective assistance even when uncertain about the goal. The method is evaluated in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that CLIPS significantly outperforms GPT-4V, LLM-based literal instruction following and unimodal inverse planning in both accuracy and helpfulness, while closely matching the inferences and assistive judgments provided by human raters. CLIPS is able to use observed actions and inferred goals to resolve ambiguous language, interpret joint instructions, and correct for incomplete commands, achieving much higher goal accuracy and cooperative efficiency than other methods. The method is based on a Bayesian model of cooperative action and communication, and uses a probabilistic program to model the human as a cooperative planner who computes a joint policy for a goal. The method is evaluated in two cooperative planning domains, and shows strong performance in goal inference and assistance tasks. The results demonstrate the importance of accounting for pragmatic context in assistive agents.This paper introduces CLIPS, a Bayesian architecture for pragmatic instruction following and goal assistance. CLIPS models humans as cooperative planners who communicate joint plans as instructions. It performs multimodal Bayesian inference over the human's goal from actions and language, using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan. Given this posterior, the assistant acts to minimize expected goal achievement cost, enabling it to pragmatically follow ambiguous instructions and provide effective assistance even when uncertain about the goal. The method is evaluated in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that CLIPS significantly outperforms GPT-4V, LLM-based literal instruction following and unimodal inverse planning in both accuracy and helpfulness, while closely matching the inferences and assistive judgments provided by human raters. CLIPS is able to use observed actions and inferred goals to resolve ambiguous language, interpret joint instructions, and correct for incomplete commands, achieving much higher goal accuracy and cooperative efficiency than other methods. The method is based on a Bayesian model of cooperative action and communication, and uses a probabilistic program to model the human as a cooperative planner who computes a joint policy for a goal. The method is evaluated in two cooperative planning domains, and shows strong performance in goal inference and assistance tasks. The results demonstrate the importance of accounting for pragmatic context in assistive agents.