10 May 2024 | Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, Pulkit Agrawal
Value Augmented Sampling (VAS) is a novel framework for reward optimization in Large Language Models (LLMs) that maximizes different reward functions using data sampled from an initial, frozen LLM. Unlike traditional reinforcement learning (RL) methods that require modifying the LLM's weights, VAS does not need access to the pre-trained LLM's weights, making it suitable for adapting models like ChatGPT, which are only available via APIs. VAS enables the composition of multiple rewards and allows for fine-grained control over the extent of each reward during deployment, paving the way for more personalized and aligned LLMs.
VAS outperforms established baselines such as PPO and DPO on standard benchmarks, achieving comparable results to Best-of-128 with significantly lower inference cost. It also provides the ability to adapt LLMs to new user preferences without retraining, and it can be applied to black-box models. The method bypasses the bi-level optimization process common in actor-critic RL, leading to more stable and efficient optimization.
In experiments, VAS demonstrated superior performance in summarization and chat dialogue tasks, outperforming PPO and FUDGE in reward and win rates. It also showed effectiveness in multi-reward optimization, where it maximized multiple alignment objectives without sacrificing performance on any axis. Additionally, VAS enabled fine-grained control over response characteristics such as formality and verbosity, allowing for personalized outputs.
VAS was also shown to be effective in teaching GPT-3.5 to use API tools, demonstrating its capability to adapt closed-source models. The method's efficiency and flexibility make it a promising approach for aligning and personalizing LLMs in real-world applications. However, challenges remain in improving computational efficiency and scalability, particularly in environments with large action spaces.Value Augmented Sampling (VAS) is a novel framework for reward optimization in Large Language Models (LLMs) that maximizes different reward functions using data sampled from an initial, frozen LLM. Unlike traditional reinforcement learning (RL) methods that require modifying the LLM's weights, VAS does not need access to the pre-trained LLM's weights, making it suitable for adapting models like ChatGPT, which are only available via APIs. VAS enables the composition of multiple rewards and allows for fine-grained control over the extent of each reward during deployment, paving the way for more personalized and aligned LLMs.
VAS outperforms established baselines such as PPO and DPO on standard benchmarks, achieving comparable results to Best-of-128 with significantly lower inference cost. It also provides the ability to adapt LLMs to new user preferences without retraining, and it can be applied to black-box models. The method bypasses the bi-level optimization process common in actor-critic RL, leading to more stable and efficient optimization.
In experiments, VAS demonstrated superior performance in summarization and chat dialogue tasks, outperforming PPO and FUDGE in reward and win rates. It also showed effectiveness in multi-reward optimization, where it maximized multiple alignment objectives without sacrificing performance on any axis. Additionally, VAS enabled fine-grained control over response characteristics such as formality and verbosity, allowing for personalized outputs.
VAS was also shown to be effective in teaching GPT-3.5 to use API tools, demonstrating its capability to adapt closed-source models. The method's efficiency and flexibility make it a promising approach for aligning and personalizing LLMs in real-world applications. However, challenges remain in improving computational efficiency and scalability, particularly in environments with large action spaces.