Understanding ToMBench%3A Benchmarking Theory of Mind in Large Language Models

The paper introduces T@MBENCH, a comprehensive benchmark for evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs). To address the limitations of existing ToM evaluations, T@MBENCH features three key characteristics: a systematic framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format for automated and unbiased evaluation, and a build-from-scratch bilingual inventory to avoid data leakage. The authors conducted extensive experiments using T@MBENCH to assess the ToM performance of 10 popular LLMs, finding that even advanced models like GPT-4 lag behind human performance by over 10% points. The study highlights the need for more robust and general ToM capabilities in LLMs and aims to facilitate the development of LLMs with inherent social intelligence. The paper also discusses the limitations of the benchmark and future directions for research.The paper introduces T@MBENCH, a comprehensive benchmark for evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs). To address the limitations of existing ToM evaluations, T@MBENCH features three key characteristics: a systematic framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format for automated and unbiased evaluation, and a build-from-scratch bilingual inventory to avoid data leakage. The authors conducted extensive experiments using T@MBENCH to assess the ToM performance of 10 popular LLMs, finding that even advanced models like GPT-4 lag behind human performance by over 10% points. The study highlights the need for more robust and general ToM capabilities in LLMs and aims to facilitate the development of LLMs with inherent social intelligence. The paper also discusses the limitations of the benchmark and future directions for research.

T@MBENCH: Benchmarking Theory of Mind in Large Language Models

23 Feb 2024 | Zhuang Chen1, Jincenzi Wu2, Jinfeng Zhou1, Bosi Wen1, Guanqun Bi1,3 Gongyao Jiang4 Yaru Cao1,5 Mengting Hu6 Yonghui Li7 Zexuan Xiong1 Minlie Huang1 †

T@MBENCH: Benchmarking Theory of Mind in Large Language Models

23 Feb 2024 | Zhuang Chen1*, Jincenzi Wu2*, Jinfeng Zhou1*, Bosi Wen1*, Guanqun Bi1,3 Gongyao Jiang4 Yaru Cao1,5 Mengting Hu6 Yonghui Li7 Zexuan Xiong1 Minlie Huang1 †

23 Feb 2024 | Zhuang Chen1, Jincenzi Wu2, Jinfeng Zhou1, Bosi Wen1, Guanqun Bi1,3 Gongyao Jiang4 Yaru Cao1,5 Mengting Hu6 Yonghui Li7 Zexuan Xiong1 Minlie Huang1 †