Can large language models explore in-context?

Can large language models explore in-context?

March 2024 | Akshay Krishnamurthy*1, Keegan Harris†2, Dylan J. Foster1, Cyril Zhang1, and Aleksandrs Slivkins*1
The paper investigates whether contemporary Large Language Models (LLMs) can engage in exploration, a key capability in reinforcement learning and decision-making. The authors focus on the native performance of existing LLMs without training interventions, deploying them as agents in simple multi-armed bandit (MAB) environments where the environment description and interaction history are specified entirely within the LLM prompt. They experiment with GPT-3.5, GPT-4, and LLaMA2, using various prompt designs. The findings indicate that only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history. All other configurations failed to robustly explore, including those with chain-of-thought reasoning but without summarized history. The authors conclude that external summarization, which may not be feasible in more complex settings, is crucial for obtaining desirable behavior from LLM agents. They suggest that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be necessary to empower LLM-based decision-making agents in complex environments. The paper also discusses the challenges and limitations of assessing LLM capabilities and the need for methodological advancements to further evaluate and mitigate exploration failures.The paper investigates whether contemporary Large Language Models (LLMs) can engage in exploration, a key capability in reinforcement learning and decision-making. The authors focus on the native performance of existing LLMs without training interventions, deploying them as agents in simple multi-armed bandit (MAB) environments where the environment description and interaction history are specified entirely within the LLM prompt. They experiment with GPT-3.5, GPT-4, and LLaMA2, using various prompt designs. The findings indicate that only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history. All other configurations failed to robustly explore, including those with chain-of-thought reasoning but without summarized history. The authors conclude that external summarization, which may not be feasible in more complex settings, is crucial for obtaining desirable behavior from LLM agents. They suggest that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be necessary to empower LLM-based decision-making agents in complex environments. The paper also discusses the challenges and limitations of assessing LLM capabilities and the need for methodological advancements to further evaluate and mitigate exploration failures.
Reach us at info@study.space