Can large language models explore in-context?

Can large language models explore in-context?

March 2024 | Akshay Krishnamurthy, Keegan Harris, Dylan J. Foster, Cyril Zhang, and Aleksandrs Slivkins
This paper investigates whether contemporary large language models (LLMs) can robustly explore in simple reinforcement learning (RL) environments, specifically multi-armed bandits (MABs). The study evaluates the native performance of GPT-3.5, GPT-4, and LLAMA2 without training interventions, deploying them as agents in MAB environments where the environment description and interaction history are specified entirely in-context. The results show that only one configuration—GPT-4 with a specific prompt design involving chain-of-thought reasoning and an externally summarized interaction history—results in satisfactory exploratory behavior. All other configurations fail to converge to the best arm with significant probability, often due to suffix failures or uniform-like failures. The study highlights the importance of external summarization in enabling LLMs to exhibit desirable behavior in complex settings. While the current generation of LLMs can explore in simple RL environments with appropriate prompt engineering, more sophisticated exploration capabilities may require training interventions such as fine-tuning or dataset curation. The findings suggest that non-trivial algorithmic interventions are necessary to empower LLM-based decision making agents in complex settings. The paper also discusses the challenges of assessing LLM capabilities, including the need for extensive experimentation and the use of surrogate statistics to detect long-term exploration failures. The results indicate that LLMs often exhibit bimodal behavior, with some configurations failing to explore effectively. The successful configuration involves a combination of GPT-4 and a prompt design that includes a suggestive framing, summarized history, and chain-of-thought reasoning. This configuration performs comparably to baseline algorithms like Thompson Sampling in terms of reward. The study concludes that while LLMs can exhibit exploratory behavior under specific conditions, their performance in complex environments is limited without additional interventions. The findings underscore the need for further research into algorithmic interventions to enhance LLM-based decision making in more complex settings.This paper investigates whether contemporary large language models (LLMs) can robustly explore in simple reinforcement learning (RL) environments, specifically multi-armed bandits (MABs). The study evaluates the native performance of GPT-3.5, GPT-4, and LLAMA2 without training interventions, deploying them as agents in MAB environments where the environment description and interaction history are specified entirely in-context. The results show that only one configuration—GPT-4 with a specific prompt design involving chain-of-thought reasoning and an externally summarized interaction history—results in satisfactory exploratory behavior. All other configurations fail to converge to the best arm with significant probability, often due to suffix failures or uniform-like failures. The study highlights the importance of external summarization in enabling LLMs to exhibit desirable behavior in complex settings. While the current generation of LLMs can explore in simple RL environments with appropriate prompt engineering, more sophisticated exploration capabilities may require training interventions such as fine-tuning or dataset curation. The findings suggest that non-trivial algorithmic interventions are necessary to empower LLM-based decision making agents in complex settings. The paper also discusses the challenges of assessing LLM capabilities, including the need for extensive experimentation and the use of surrogate statistics to detect long-term exploration failures. The results indicate that LLMs often exhibit bimodal behavior, with some configurations failing to explore effectively. The successful configuration involves a combination of GPT-4 and a prompt design that includes a suggestive framing, summarized history, and chain-of-thought reasoning. This configuration performs comparably to baseline algorithms like Thompson Sampling in terms of reward. The study concludes that while LLMs can exhibit exploratory behavior under specific conditions, their performance in complex environments is limited without additional interventions. The findings underscore the need for further research into algorithmic interventions to enhance LLM-based decision making in more complex settings.
Reach us at info@study.space