TMBENCH is a comprehensive benchmark for evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs). The benchmark includes 8 social cognition tasks and 31 ToM abilities, covering diverse real-world social scenarios. It features a multiple-choice question format to enable automated and unbiased evaluation, and a build-from-scratch bilingual inventory to prevent data leakage. TMBENCH was used to evaluate the ToM performance of 10 popular LLMs, including GPT-4, LLaMA, and others. Results show that even the most advanced LLMs lag behind human performance by over 10% points in most tasks and abilities, indicating that LLMs have not yet achieved human-level ToM. The benchmark aims to provide an efficient and effective evaluation framework for LLMs, facilitating the development of LLMs with inherent social intelligence. TMBENCH includes 2,860 testing samples in both Chinese and English, covering a wide range of social scenarios. The benchmark also includes a detailed analysis of LLMs' performance across different tasks and abilities, highlighting their limitations in fully understanding social scenarios. The results suggest that LLMs still rely on semantic associations rather than human-like cognitive processes when addressing ToM questions. TMBENCH provides a systematic, automated, and original benchmark for evaluating LLMs' ToM capabilities, contributing to future research in this area. The benchmark also addresses limitations such as data contamination, inventory size, and language coverage, and suggests future directions for improving LLMs' ToM capabilities. The study emphasizes the importance of a holistic benchmark for evaluating LLMs' ToM capabilities, as well as the need for further research into how LLMs can better understand and interpret human mental states and cognitive processes.TMBENCH is a comprehensive benchmark for evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs). The benchmark includes 8 social cognition tasks and 31 ToM abilities, covering diverse real-world social scenarios. It features a multiple-choice question format to enable automated and unbiased evaluation, and a build-from-scratch bilingual inventory to prevent data leakage. TMBENCH was used to evaluate the ToM performance of 10 popular LLMs, including GPT-4, LLaMA, and others. Results show that even the most advanced LLMs lag behind human performance by over 10% points in most tasks and abilities, indicating that LLMs have not yet achieved human-level ToM. The benchmark aims to provide an efficient and effective evaluation framework for LLMs, facilitating the development of LLMs with inherent social intelligence. TMBENCH includes 2,860 testing samples in both Chinese and English, covering a wide range of social scenarios. The benchmark also includes a detailed analysis of LLMs' performance across different tasks and abilities, highlighting their limitations in fully understanding social scenarios. The results suggest that LLMs still rely on semantic associations rather than human-like cognitive processes when addressing ToM questions. TMBENCH provides a systematic, automated, and original benchmark for evaluating LLMs' ToM capabilities, contributing to future research in this area. The benchmark also addresses limitations such as data contamination, inventory size, and language coverage, and suggests future directions for improving LLMs' ToM capabilities. The study emphasizes the importance of a holistic benchmark for evaluating LLMs' ToM capabilities, as well as the need for further research into how LLMs can better understand and interpret human mental states and cognitive processes.