Can Language Models Serve as Text-Based World Simulators?

Can Language Models Serve as Text-Based World Simulators?

10 Jun 2024 | Ruoyao Wang†, Graham Todd†, Ziang Xiao♠, Xingdi Yuan◇ Marc-Alexandre Côté♡, Peter Clark♣, Peter Jansen♠
The paper explores whether current language models (LLMs) can serve as text-based world simulators, capable of predicting how actions change different world states. The authors introduce a new benchmark, BYTESIZED32-State-Prediction, which contains a dataset of text game state transitions and accompanying tasks. They test GPT-4 on this dataset to quantify its ability to simulate virtual environments. Despite GPT-4's impressive performance, the study finds that it is still unreliable without further innovations. The research contributes insights into the capabilities and limitations of current LLMs and provides a novel benchmark to track future progress. The study highlights that LLMs struggle with non-trivial state transitions, arithmetic, common-sense, and scientific reasoning, with accuracy dropping to 59.9% for transitions involving complex changes. The results suggest that while LLMs are useful for downstream tasks, they are not yet ready to act as reliable world simulators. The paper also discusses limitations and ethical concerns, emphasizing the need for careful application of LLMs in settings where they directly interact with humans.The paper explores whether current language models (LLMs) can serve as text-based world simulators, capable of predicting how actions change different world states. The authors introduce a new benchmark, BYTESIZED32-State-Prediction, which contains a dataset of text game state transitions and accompanying tasks. They test GPT-4 on this dataset to quantify its ability to simulate virtual environments. Despite GPT-4's impressive performance, the study finds that it is still unreliable without further innovations. The research contributes insights into the capabilities and limitations of current LLMs and provides a novel benchmark to track future progress. The study highlights that LLMs struggle with non-trivial state transitions, arithmetic, common-sense, and scientific reasoning, with accuracy dropping to 59.9% for transitions involving complex changes. The results suggest that while LLMs are useful for downstream tasks, they are not yet ready to act as reliable world simulators. The paper also discusses limitations and ethical concerns, emphasizing the need for careful application of LLMs in settings where they directly interact with humans.
Reach us at info@study.space