[slides and audio] GTBench%3A Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

This paper evaluates the strategic and logical reasoning abilities of Large Language Models (LLMs) through game-theoretic tasks, such as board and card games. The authors propose GTBENCH, a language-driven environment that includes 10 widely recognized tasks across a comprehensive taxonomy of games, covering complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. They characterize the game-theoretic reasoning of LLMs and perform LLM-vs.-LLM competitions to assess their reasoning capabilities. Key findings include: 1. **LLMs' Behavior in Different Scenarios**: LLMs perform poorly in complete and deterministic games but are competitive in probabilistic gaming scenarios. 2. **Commercial vs. Open-Source LLMs**: Commercial LLMs like GPT-4 outperform most open-source LLMs in complex games, though the recently released Llama-3-70b-Instruct shows comparable performance. 3. **Code-Pretraining**: Code-pretrained LLMs benefit from game-theoretic reasoning, achieving comparable results to commercial LLMs. 4. **Advanced Reasoning Methods**: Advanced reasoning methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always improve performance. 5. **Game-Theoretic Properties**: The paper also characterizes the game-theoretic properties of LLMs, such as equilibrium and Pareto efficiency in repeated games. The research provides standardized protocols and insights into the strategic reasoning capabilities of LLMs, contributing to the field of AI safety and responsible AI development.This paper evaluates the strategic and logical reasoning abilities of Large Language Models (LLMs) through game-theoretic tasks, such as board and card games. The authors propose GTBENCH, a language-driven environment that includes 10 widely recognized tasks across a comprehensive taxonomy of games, covering complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. They characterize the game-theoretic reasoning of LLMs and perform LLM-vs.-LLM competitions to assess their reasoning capabilities. Key findings include: 1. **LLMs' Behavior in Different Scenarios**: LLMs perform poorly in complete and deterministic games but are competitive in probabilistic gaming scenarios. 2. **Commercial vs. Open-Source LLMs**: Commercial LLMs like GPT-4 outperform most open-source LLMs in complex games, though the recently released Llama-3-70b-Instruct shows comparable performance. 3. **Code-Pretraining**: Code-pretrained LLMs benefit from game-theoretic reasoning, achieving comparable results to commercial LLMs. 4. **Advanced Reasoning Methods**: Advanced reasoning methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always improve performance. 5. **Game-Theoretic Properties**: The paper also characterizes the game-theoretic properties of LLMs, such as equilibrium and Pareto efficiency in repeated games. The research provides standardized protocols and insights into the strategic reasoning capabilities of LLMs, contributing to the field of AI safety and responsible AI development.

GTBENCH: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

10 Jun 2024 | Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kaikhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu