GTBENCH: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

GTBENCH: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

10 Jun 2024 | Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kaikhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu
GTBENCH is a game-theoretic evaluation environment designed to assess the strategic reasoning abilities of large language models (LLMs). The environment includes 10 widely recognized game-theoretic tasks across a comprehensive game taxonomy, covering complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. The study evaluates LLMs' performance in competitive environments, such as board and card games, which require pure logic and strategic reasoning. The results show that LLMs perform poorly in complete and deterministic games but are competitive in probabilistic scenarios. Open-source LLMs, such as CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs like GPT-4, although the recently released Llama-3-70b-Instruct performs better. Code-pretraining benefits strategic reasoning, while advanced reasoning methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. The study also characterizes the game-theoretic properties of LLMs, such as equilibrium and Pareto efficiency in repeated games. Detailed error profiles are provided to better understand LLMs' behavior. GTBENCH provides standardized protocols and serves as a foundation for further exploration of LLMs' strategic reasoning capabilities. The study highlights the importance of game-theoretic tasks in evaluating LLMs, as they require logical reasoning without the complexity of narrative contexts. The results indicate that LLMs are not competitive in complete and deterministic games but perform well in probabilistic scenarios. The study also shows that LLMs can be competitive against each other in game-theoretic scenarios, with GPT-4 and Llama-3-70b-Instruct performing the best. The findings suggest that code-pretraining benefits strategic reasoning, while advanced reasoning methods may not always be effective. The study also identifies common error patterns in LLMs, such as misinterpretation, factual inaccuracies, overconfidence, calculation mistakes, and endgame misdetection. Overall, GTBENCH provides a comprehensive evaluation of LLMs' strategic reasoning abilities in competitive environments.GTBENCH is a game-theoretic evaluation environment designed to assess the strategic reasoning abilities of large language models (LLMs). The environment includes 10 widely recognized game-theoretic tasks across a comprehensive game taxonomy, covering complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. The study evaluates LLMs' performance in competitive environments, such as board and card games, which require pure logic and strategic reasoning. The results show that LLMs perform poorly in complete and deterministic games but are competitive in probabilistic scenarios. Open-source LLMs, such as CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs like GPT-4, although the recently released Llama-3-70b-Instruct performs better. Code-pretraining benefits strategic reasoning, while advanced reasoning methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. The study also characterizes the game-theoretic properties of LLMs, such as equilibrium and Pareto efficiency in repeated games. Detailed error profiles are provided to better understand LLMs' behavior. GTBENCH provides standardized protocols and serves as a foundation for further exploration of LLMs' strategic reasoning capabilities. The study highlights the importance of game-theoretic tasks in evaluating LLMs, as they require logical reasoning without the complexity of narrative contexts. The results indicate that LLMs are not competitive in complete and deterministic games but perform well in probabilistic scenarios. The study also shows that LLMs can be competitive against each other in game-theoretic scenarios, with GPT-4 and Llama-3-70b-Instruct performing the best. The findings suggest that code-pretraining benefits strategic reasoning, while advanced reasoning methods may not always be effective. The study also identifies common error patterns in LLMs, such as misinterpretation, factual inaccuracies, overconfidence, calculation mistakes, and endgame misdetection. Overall, GTBENCH provides a comprehensive evaluation of LLMs' strategic reasoning abilities in competitive environments.
Reach us at info@study.space
[slides and audio] GTBench%3A Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations