29 May 2024 | Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang
The paper introduces a novel method called Self-Exploring Language Models (SELM) for aligning Large Language Models (LLMs) with human intentions through active preference elicitation. Unlike traditional offline alignment methods, which rely on a fixed dataset, online alignment involves iterative feedback collection from humans or AI, leading to more capable reward models and better-aligned LLMs. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses across the vast space of natural language. Random sampling from standard reward-maximizing LLMs is insufficient for this purpose.
To address this issue, the authors propose a bilevel objective that optimistically biases towards potentially high-reward responses, enabling active exploration of out-of-distribution regions. By solving the inner-level problem with a reparameterized reward function, the resulting algorithm, SELM, eliminates the need for a separate reward model and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), SELM reduces the indiscriminate favoring of unseen extrapolations and enhances exploration efficiency.
Experimental results demonstrate that SELM significantly boosts the performance of Zephyr-7B-SFT and Llama-3-8B-Instruct models on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks. The code and models are available at <https://github.com/shenao-zhang/SELM>.The paper introduces a novel method called Self-Exploring Language Models (SELM) for aligning Large Language Models (LLMs) with human intentions through active preference elicitation. Unlike traditional offline alignment methods, which rely on a fixed dataset, online alignment involves iterative feedback collection from humans or AI, leading to more capable reward models and better-aligned LLMs. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses across the vast space of natural language. Random sampling from standard reward-maximizing LLMs is insufficient for this purpose.
To address this issue, the authors propose a bilevel objective that optimistically biases towards potentially high-reward responses, enabling active exploration of out-of-distribution regions. By solving the inner-level problem with a reparameterized reward function, the resulting algorithm, SELM, eliminates the need for a separate reward model and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), SELM reduces the indiscriminate favoring of unseen extrapolations and enhances exploration efficiency.
Experimental results demonstrate that SELM significantly boosts the performance of Zephyr-7B-SFT and Llama-3-8B-Instruct models on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks. The code and models are available at <https://github.com/shenao-zhang/SELM>.