The paper introduces T@MBENCH, a comprehensive benchmark for evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs). To address the limitations of existing ToM evaluations, T@MBENCH features three key characteristics: a systematic framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format for automated and unbiased evaluation, and a build-from-scratch bilingual inventory to avoid data leakage. The authors conducted extensive experiments using T@MBENCH to assess the ToM performance of 10 popular LLMs, finding that even advanced models like GPT-4 lag behind human performance by over 10% points. The study highlights the need for more robust and general ToM capabilities in LLMs and aims to facilitate the development of LLMs with inherent social intelligence. The paper also discusses the limitations of the benchmark and future directions for research.The paper introduces T@MBENCH, a comprehensive benchmark for evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs). To address the limitations of existing ToM evaluations, T@MBENCH features three key characteristics: a systematic framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format for automated and unbiased evaluation, and a build-from-scratch bilingual inventory to avoid data leakage. The authors conducted extensive experiments using T@MBENCH to assess the ToM performance of 10 popular LLMs, finding that even advanced models like GPT-4 lag behind human performance by over 10% points. The study highlights the need for more robust and general ToM capabilities in LLMs and aims to facilitate the development of LLMs with inherent social intelligence. The paper also discusses the limitations of the benchmark and future directions for research.