Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

29 May 2024 | Shenao Zhang¹, Donghan Yu², Hiteshi Sharma², Ziyi Yang², Shuohang Wang², Hany Hassan², Zhaoran Wang¹
This paper introduces Self-Exploring Language Models (SELM), an active preference elicitation method for online alignment of large language models (LLMs). SELM incorporates an optimism term into the reward-fitting objective to balance between exploiting observed data and exploring potentially high-reward regions. Unlike standard online RLHF algorithms that passively explore the response space, SELM actively seeks diverse and high-quality responses. This self-exploration mechanism helps mitigate the risk of premature convergence and overfitting when the reward model is only locally accurate. To optimize this bilevel objective, the paper solves the inner-level problem and reparameterizes the reward with the LLM policy, resulting in a simple yet novel iterative alignment algorithm called SELM. Compared to Direct Preference Optimization (DPO), SELM improves exploration efficiency by selectively favoring responses with high potential rewards rather than indiscriminately sampling unseen responses. Experiments on Zephyr-7B-SFT and Llama-3-8B-Instruct models demonstrate the efficacy of SELM. Finetuning on the UltraFeedback dataset and leveraging PairRM for AI feedback, SELM achieves substantial improvements in performance on AlpacaEval 2.0, MT-Bench, and academic benchmarks. These results underscore the ability of SELM to enhance the alignment and capabilities of large language models by promoting more diverse and high-quality responses. The proposed technique is orthogonal to the adopted online RLHF workflow and can be integrated directly into recent online RLHF workflows with or without a separate reward model. Future work includes applying the method within more sophisticated alignment frameworks with advanced designs.This paper introduces Self-Exploring Language Models (SELM), an active preference elicitation method for online alignment of large language models (LLMs). SELM incorporates an optimism term into the reward-fitting objective to balance between exploiting observed data and exploring potentially high-reward regions. Unlike standard online RLHF algorithms that passively explore the response space, SELM actively seeks diverse and high-quality responses. This self-exploration mechanism helps mitigate the risk of premature convergence and overfitting when the reward model is only locally accurate. To optimize this bilevel objective, the paper solves the inner-level problem and reparameterizes the reward with the LLM policy, resulting in a simple yet novel iterative alignment algorithm called SELM. Compared to Direct Preference Optimization (DPO), SELM improves exploration efficiency by selectively favoring responses with high potential rewards rather than indiscriminately sampling unseen responses. Experiments on Zephyr-7B-SFT and Llama-3-8B-Instruct models demonstrate the efficacy of SELM. Finetuning on the UltraFeedback dataset and leveraging PairRM for AI feedback, SELM achieves substantial improvements in performance on AlpacaEval 2.0, MT-Bench, and academic benchmarks. These results underscore the ability of SELM to enhance the alignment and capabilities of large language models by promoting more diverse and high-quality responses. The proposed technique is orthogonal to the adopted online RLHF workflow and can be integrated directly into recent online RLHF workflows with or without a separate reward model. Future work includes applying the method within more sophisticated alignment frameworks with advanced designs.
Reach us at info@study.space