May 6 ~ 10, 2024, Auckland, New Zealand | Tan Zhi-Xuan*, Lance Ying*, Vikash Mansinghka, Joshua B. Tenenbaum
This paper introduces Cooperative Language-Guided Inverse Plan Search (CLIPS), a Bayesian architecture for pragmatic instruction following and goal assistance. CLIPS models humans as cooperative planners who communicate joint plans to the assistant, which then performs multimodal Bayesian inference over the human's goal from actions and language. Using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan, CLIPS acts to minimize expected goal achievement cost, enabling it to follow ambiguous instructions effectively. The paper evaluates CLIPS in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that it significantly outperforms GPT-4V, LLM-based literal instruction following, and unimodal inverse planning in both accuracy and helpfulness, while closely matching human raters' judgments. CLIPS leverages joint intentionality, Bayesian Theory of Mind, and multimodal LLMs to achieve these results, demonstrating its ability to integrate information from observed actions and ambiguous instructions, and provide context-sensitive assistance.This paper introduces Cooperative Language-Guided Inverse Plan Search (CLIPS), a Bayesian architecture for pragmatic instruction following and goal assistance. CLIPS models humans as cooperative planners who communicate joint plans to the assistant, which then performs multimodal Bayesian inference over the human's goal from actions and language. Using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan, CLIPS acts to minimize expected goal achievement cost, enabling it to follow ambiguous instructions effectively. The paper evaluates CLIPS in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that it significantly outperforms GPT-4V, LLM-based literal instruction following, and unimodal inverse planning in both accuracy and helpfulness, while closely matching human raters' judgments. CLIPS leverages joint intentionality, Bayesian Theory of Mind, and multimodal LLMs to achieve these results, demonstrating its ability to integrate information from observed actions and ambiguous instructions, and provide context-sensitive assistance.