Value Augmented Sampling for Language Model Alignment and Personalization

Value Augmented Sampling for Language Model Alignment and Personalization

10 May 2024 | Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, Pulkit Agrawal
The paper introduces a new framework called Value Augmented Sampling (VAS) for optimizing reward functions in Large Language Models (LLMs). VAS aims to address the challenges of existing methods, such as high inference costs and optimization instability, by using data from the initial, frozen LLM to estimate the value function. This approach avoids the need for co-training the policy and value function, making the optimization process more stable and efficient. VAS outperforms established baselines like PPO and DPO on standard benchmarks while being significantly more computationally efficient. It also enables fine-grained control over the extent of reward optimization during deployment, allowing for personalized LLMs. Additionally, VAS can adapt black-box models without access to their weights, making it versatile for various applications. The paper demonstrates VAS's effectiveness through experiments on summarization and multi-turn chat dialogue tasks, showing superior performance and stability compared to competing methods.The paper introduces a new framework called Value Augmented Sampling (VAS) for optimizing reward functions in Large Language Models (LLMs). VAS aims to address the challenges of existing methods, such as high inference costs and optimization instability, by using data from the initial, frozen LLM to estimate the value function. This approach avoids the need for co-training the policy and value function, making the optimization process more stable and efficient. VAS outperforms established baselines like PPO and DPO on standard benchmarks while being significantly more computationally efficient. It also enables fine-grained control over the extent of reward optimization during deployment, allowing for personalized LLMs. Additionally, VAS can adapt black-box models without access to their weights, making it versatile for various applications. The paper demonstrates VAS's effectiveness through experiments on summarization and multi-turn chat dialogue tasks, showing superior performance and stability compared to competing methods.
Reach us at info@study.space