GAMEBENCH: Evaluating Strategic Reasoning Abilities of LLM Agents

GAMEBENCH: Evaluating Strategic Reasoning Abilities of LLM Agents

22 Jul 2024 | Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, Arjun Yadav
**GAMEBENCH: Evaluating Strategic Reasoning Abilities of LLM Agents** **Abstract:** Large language models (LLMs) have demonstrated impressive few-shot performance on natural language understanding tasks. However, there is a lack of comprehensive frameworks to evaluate LLMs' strategic reasoning abilities in complex, strategic scenarios. To address this gap, we introduce GAMEBENCH, a cross-domain benchmark designed to evaluate LLM agents' strategic reasoning across various game environments. We focus on 9 different game environments, each covering at least one axis of key reasoning skills identified in strategy games. Our evaluations use GPT-3 and GPT-4, along with two scaffolding frameworks: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). The results show that none of the tested models match human performance, and GPT-4 performs worse than random action. CoT and RAP improve scores but do not reach human levels. **Introduction:** The capabilities of LLMs have rapidly advanced, enabling their use in various agentic tasks. While existing benchmarks evaluate LLMs on practical, in-distribution knowledge, a strategic reasoning benchmark is needed to evaluate out-of-distribution reasoning abilities. GAMEBENCH aims to fill this gap by providing a diverse suite of multi-agent games that span a range of strategic reasoning domains, including abstract strategy, non-deterministic outcomes, hidden information, language communication, social deduction, and cooperation. **Game Selection:** We curated a diverse set of games, including board games, card games, and social deception games, to evaluate LLMs' strategic reasoning abilities. The games were selected to ensure that game-specific strategies were unlikely to be represented in the pretraining data of LLMs, thus providing a true out-of-distribution test. **API and Rating Calculation:** Each game environment is implemented in Python, allowing agents to interact with the game state and available actions. Agents are rated using the exponential Bradley-Terry model, which assumes that each agent's ability is fixed and does not change over time, making it suitable for evaluating LLMs. **Empirical Results:** Our results show that CoT-augmented models outperform the base models, while RAP-augmented models show some improvement but fall short of human levels. GPT-3 performs poorly, while GPT-4, despite its superior performance in other tasks, performs worse than random action. Human trials consistently outperform all LLM agents, highlighting the need for further improvements in strategic reasoning. **Discussion:** We discuss the limitations of our work, including the need to confirm out-of-distribution status, protect out-of-distribution games from becoming in-distribution, and improve the robustness of our aggregation methods. We also propose future directions, such as adding more games and agents to the benchmark and collecting more comprehensive human data. **Conclusion:** GAMEBENCH is the first benchmark to evaluate LLM agents' strategic reasoning abilities**GAMEBENCH: Evaluating Strategic Reasoning Abilities of LLM Agents** **Abstract:** Large language models (LLMs) have demonstrated impressive few-shot performance on natural language understanding tasks. However, there is a lack of comprehensive frameworks to evaluate LLMs' strategic reasoning abilities in complex, strategic scenarios. To address this gap, we introduce GAMEBENCH, a cross-domain benchmark designed to evaluate LLM agents' strategic reasoning across various game environments. We focus on 9 different game environments, each covering at least one axis of key reasoning skills identified in strategy games. Our evaluations use GPT-3 and GPT-4, along with two scaffolding frameworks: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). The results show that none of the tested models match human performance, and GPT-4 performs worse than random action. CoT and RAP improve scores but do not reach human levels. **Introduction:** The capabilities of LLMs have rapidly advanced, enabling their use in various agentic tasks. While existing benchmarks evaluate LLMs on practical, in-distribution knowledge, a strategic reasoning benchmark is needed to evaluate out-of-distribution reasoning abilities. GAMEBENCH aims to fill this gap by providing a diverse suite of multi-agent games that span a range of strategic reasoning domains, including abstract strategy, non-deterministic outcomes, hidden information, language communication, social deduction, and cooperation. **Game Selection:** We curated a diverse set of games, including board games, card games, and social deception games, to evaluate LLMs' strategic reasoning abilities. The games were selected to ensure that game-specific strategies were unlikely to be represented in the pretraining data of LLMs, thus providing a true out-of-distribution test. **API and Rating Calculation:** Each game environment is implemented in Python, allowing agents to interact with the game state and available actions. Agents are rated using the exponential Bradley-Terry model, which assumes that each agent's ability is fixed and does not change over time, making it suitable for evaluating LLMs. **Empirical Results:** Our results show that CoT-augmented models outperform the base models, while RAP-augmented models show some improvement but fall short of human levels. GPT-3 performs poorly, while GPT-4, despite its superior performance in other tasks, performs worse than random action. Human trials consistently outperform all LLM agents, highlighting the need for further improvements in strategic reasoning. **Discussion:** We discuss the limitations of our work, including the need to confirm out-of-distribution status, protect out-of-distribution games from becoming in-distribution, and improve the robustness of our aggregation methods. We also propose future directions, such as adding more games and agents to the benchmark and collecting more comprehensive human data. **Conclusion:** GAMEBENCH is the first benchmark to evaluate LLM agents' strategic reasoning abilities
Reach us at info@study.space