[slides] Efficient Exploration for LLMs

The paper presents evidence of significant benefits from efficient exploration in gathering human feedback to improve large language models (LLMs). The authors conduct experiments where an agent sequentially generates queries while fitting a reward model to the feedback received. The best-performing agent uses double Thompson sampling, with uncertainty represented by an epistemic neural network (ENN). The results demonstrate that efficient exploration enables high levels of performance with fewer queries. Both uncertainty estimation and the choice of exploration scheme play critical roles. Large language models (LLMs) have shown remarkable capabilities after learning from vast amounts of text data. Reinforcement learning from human feedback (RLHF) further improves their behavior, even with only tens of thousands of interactions. The increasing volume of human feedback from chatbots provides opportunities to gather more data, which can lead to more sophisticated behavior. However, the process of learning from human feedback is challenging, and it is unclear how to achieve superhuman ingenuity. The paper introduces the concept of active exploration, which involves tailoring interactions to elicit useful feedback. The authors compare the performance of passive exploration, which uses a language model to select responses, with several active exploration algorithms. These include Boltzmann exploration, which selects responses with higher predicted rewards, and two ENN-based approaches: infomax, which aims to maximize information gain, and double Thompson sampling, which samples responses according to the probability they are optimal. The experimentation pipeline includes a learning pipeline and an assessment pipeline. The learning pipeline governs the interaction between the agent and a human feedback simulator, while the assessment pipeline evaluates the agent's performance using a human preference simulator. The reward models guide response selection in both phases. The paper also discusses the architecture and training of reward models, including point estimate models and ENN models. The empirical results show that active exploration accelerates learning and achieves higher win rates. Double Thompson sampling emerges as the top performer, outperforming other algorithms. The quality of uncertainty estimates is assessed using dyadic joint negative-log loss (NLL), and the results indicate that ENN models produce more meaningful uncertainty estimates. The paper also explores the evolution of rewards assigned to responses and how double Thompson sampling converges on better responses compared to Boltzmann exploration. The paper concludes by discussing future research directions, including improving ENN architectures, tuning the LLM torso, and exploring efficient exploration in mult-turn dialogues. The findings suggest that efficient exploration can significantly reduce the time and data required to achieve superhuman creativity in LLMs.The paper presents evidence of significant benefits from efficient exploration in gathering human feedback to improve large language models (LLMs). The authors conduct experiments where an agent sequentially generates queries while fitting a reward model to the feedback received. The best-performing agent uses double Thompson sampling, with uncertainty represented by an epistemic neural network (ENN). The results demonstrate that efficient exploration enables high levels of performance with fewer queries. Both uncertainty estimation and the choice of exploration scheme play critical roles. Large language models (LLMs) have shown remarkable capabilities after learning from vast amounts of text data. Reinforcement learning from human feedback (RLHF) further improves their behavior, even with only tens of thousands of interactions. The increasing volume of human feedback from chatbots provides opportunities to gather more data, which can lead to more sophisticated behavior. However, the process of learning from human feedback is challenging, and it is unclear how to achieve superhuman ingenuity. The paper introduces the concept of active exploration, which involves tailoring interactions to elicit useful feedback. The authors compare the performance of passive exploration, which uses a language model to select responses, with several active exploration algorithms. These include Boltzmann exploration, which selects responses with higher predicted rewards, and two ENN-based approaches: infomax, which aims to maximize information gain, and double Thompson sampling, which samples responses according to the probability they are optimal. The experimentation pipeline includes a learning pipeline and an assessment pipeline. The learning pipeline governs the interaction between the agent and a human feedback simulator, while the assessment pipeline evaluates the agent's performance using a human preference simulator. The reward models guide response selection in both phases. The paper also discusses the architecture and training of reward models, including point estimate models and ENN models. The empirical results show that active exploration accelerates learning and achieves higher win rates. Double Thompson sampling emerges as the top performer, outperforming other algorithms. The quality of uncertainty estimates is assessed using dyadic joint negative-log loss (NLL), and the results indicate that ENN models produce more meaningful uncertainty estimates. The paper also explores the evolution of rewards assigned to responses and how double Thompson sampling converges on better responses compared to Boltzmann exploration. The paper concludes by discussing future research directions, including improving ENN architectures, tuning the LLM torso, and exploring efficient exploration in mult-turn dialogues. The findings suggest that efficient exploration can significantly reduce the time and data required to achieve superhuman creativity in LLMs.

Efficient Exploration for LLMs

4 Jun 2024 | Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao and Benjamin Van Roy