22 Jul 2024 | Kaiwen Wang†, Rahul Kidambi, Ryan Sullivan†, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, Alexandre Ramé, Johan Ferret, Geoffrey Cideron, Le Hou, Hongkun Yu, Amr Ahmed, Aranyak Mehta, Léonard Hussenot, Olivier Bachem, Edouard Leurent
The paper introduces Conditioned Language Policies (CLP), a general framework for multi-objective finetuning of language models. CLP addresses the challenge of developing steerable models that can trade off multiple conflicting objectives, such as creativity and safety, in a flexible and efficient manner. Unlike traditional single-objective finetuning, which requires retraining for different objectives, CLP uses multi-task training and parameter-efficient finetuning to learn models that can adapt to various reward weightings at inference time. The framework is evaluated through extensive experiments on summarization tasks, demonstrating that CLP outperforms existing state-of-the-art approaches in both output quality and steerability. The paper also provides theoretical insights, showing that zero-shot methods can be near-optimal under specific conditions but fail when policies for individual rewards do not align. CLP is shown to robustly maintain its benefits across different experimental conditions, including varying reward functions and model sizes.The paper introduces Conditioned Language Policies (CLP), a general framework for multi-objective finetuning of language models. CLP addresses the challenge of developing steerable models that can trade off multiple conflicting objectives, such as creativity and safety, in a flexible and efficient manner. Unlike traditional single-objective finetuning, which requires retraining for different objectives, CLP uses multi-task training and parameter-efficient finetuning to learn models that can adapt to various reward weightings at inference time. The framework is evaluated through extensive experiments on summarization tasks, demonstrating that CLP outperforms existing state-of-the-art approaches in both output quality and steerability. The paper also provides theoretical insights, showing that zero-shot methods can be near-optimal under specific conditions but fail when policies for individual rewards do not align. CLP is shown to robustly maintain its benefits across different experimental conditions, including varying reward functions and model sizes.