GAMEBENCH: Evaluating Strategic Reasoning Abilities of LLM Agents

GAMEBENCH: Evaluating Strategic Reasoning Abilities of LLM Agents

22 Jul 2024 | Anthony Costarelli*, Mat Allen*, Roman Hakusson*, Grace Sodunke*, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, Arjun Yadav
GAMEBENCH is a benchmark designed to evaluate the strategic reasoning abilities of large language models (LLMs) across various game environments. The benchmark includes nine different games, each covering different aspects of strategic reasoning. The study evaluates GPT-3 and GPT-4, as well as two scaffolding techniques: Chain-of-Thought (CoT) and Reasoning Via Planning (RAP). The results show that none of the tested models match human performance, and at worst, GPT-4 performs worse than random action. CoT and RAP both improve scores but not to comparable human levels. The benchmark code is available at https://github.com/Joshuaclymer/GameBench. The study introduces GAMEBENCH, a multi-player, cross-domain framework for evaluating strategic reasoning in LLM agents using games. The benchmark focuses on both discrete and open-ended action spaces across various reasoning domains, including abstract strategy, non-deterministic outcomes, hidden information, language communication, social deduction, and cooperation between players. The games selected are those without published strategy guides, ensuring that game-specific strategies are sufficiently out-of-distribution in pretraining data. The benchmark evaluates GPT-3, GPT-4, and the CoT and RAP scaffolding techniques by playing them against each other, a random-action-selector baseline, and a human baseline. The results show that CoT-augmented and RAP-augmented models demonstrate superior strategic performance compared to the random baseline, while GPT-3 matches the random baseline and GPT-4 performs worse. The human baseline performs superior to all models. The study also discusses the limitations and future directions of the work, including the need for more comprehensive human data and the importance of protecting out-of-distribution status. The results show that while scaffolding techniques can improve performance in strategic reasoning, even the best configurations fall short of human reasoning. The study concludes that LLMs show great promise in in-distribution tasks, but their performance on out-of-distribution tasks highlights the potential risks of deploying autonomous agents. The results also indicate that the receptiveness of LLMs to scaffolding techniques is increasing.GAMEBENCH is a benchmark designed to evaluate the strategic reasoning abilities of large language models (LLMs) across various game environments. The benchmark includes nine different games, each covering different aspects of strategic reasoning. The study evaluates GPT-3 and GPT-4, as well as two scaffolding techniques: Chain-of-Thought (CoT) and Reasoning Via Planning (RAP). The results show that none of the tested models match human performance, and at worst, GPT-4 performs worse than random action. CoT and RAP both improve scores but not to comparable human levels. The benchmark code is available at https://github.com/Joshuaclymer/GameBench. The study introduces GAMEBENCH, a multi-player, cross-domain framework for evaluating strategic reasoning in LLM agents using games. The benchmark focuses on both discrete and open-ended action spaces across various reasoning domains, including abstract strategy, non-deterministic outcomes, hidden information, language communication, social deduction, and cooperation between players. The games selected are those without published strategy guides, ensuring that game-specific strategies are sufficiently out-of-distribution in pretraining data. The benchmark evaluates GPT-3, GPT-4, and the CoT and RAP scaffolding techniques by playing them against each other, a random-action-selector baseline, and a human baseline. The results show that CoT-augmented and RAP-augmented models demonstrate superior strategic performance compared to the random baseline, while GPT-3 matches the random baseline and GPT-4 performs worse. The human baseline performs superior to all models. The study also discusses the limitations and future directions of the work, including the need for more comprehensive human data and the importance of protecting out-of-distribution status. The results show that while scaffolding techniques can improve performance in strategic reasoning, even the best configurations fall short of human reasoning. The study concludes that LLMs show great promise in in-distribution tasks, but their performance on out-of-distribution tasks highlights the potential risks of deploying autonomous agents. The results also indicate that the receptiveness of LLMs to scaffolding techniques is increasing.
Reach us at info@study.space