Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning

Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning

22 Jul 2024 | Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelm, Yunxuan Li, Raghav Gupta, Avinava Dubey, Alexandre Ramé, Johan Ferret, Geoffrey Cideron, Le Hou, Hongkun Yu, Amr Ahmed, Aranyak Mehta, Léonard Husseno, Olivier Bachem, Edouard Leurent
This paper introduces Conditioned Language Policies (CLP), a general framework for multi-objective fine-tuning (MOFT) that enables language models (LMs) to adapt their outputs to different reward weightings without retraining. CLP leverages multi-task learning and parameter-efficient training to learn steerable models that can trade off conflicting objectives at inference time. Unlike previous methods that require training multiple models or using prompt-based approaches, CLP learns a single model that can be conditioned on reward weightings to generate outputs that maximize the weighted combination of rewards. The framework is shown to outperform existing methods in terms of both output quality and steerability, and is more efficient in terms of computational resources. The paper also provides theoretical analysis showing that zero-shot methods can be near-optimal under certain conditions but fail otherwise, highlighting the necessity of multi-task training for achieving Pareto-optimal policies. CLP is evaluated on a variety of tasks and reward functions, demonstrating its effectiveness in generating high-quality, steerable outputs across different reward configurations. The framework is shown to be robust across various experimental conditions, including different reward functions and model sizes, and is capable of maintaining its performance even when combined with other conditioning mechanisms such as prompting. The results indicate that CLP provides a favorable trade-off between steerability and parameter efficiency, making it a promising approach for multi-objective fine-tuning.This paper introduces Conditioned Language Policies (CLP), a general framework for multi-objective fine-tuning (MOFT) that enables language models (LMs) to adapt their outputs to different reward weightings without retraining. CLP leverages multi-task learning and parameter-efficient training to learn steerable models that can trade off conflicting objectives at inference time. Unlike previous methods that require training multiple models or using prompt-based approaches, CLP learns a single model that can be conditioned on reward weightings to generate outputs that maximize the weighted combination of rewards. The framework is shown to outperform existing methods in terms of both output quality and steerability, and is more efficient in terms of computational resources. The paper also provides theoretical analysis showing that zero-shot methods can be near-optimal under certain conditions but fail otherwise, highlighting the necessity of multi-task training for achieving Pareto-optimal policies. CLP is evaluated on a variety of tasks and reward functions, demonstrating its effectiveness in generating high-quality, steerable outputs across different reward configurations. The framework is shown to be robust across various experimental conditions, including different reward functions and model sizes, and is capable of maintaining its performance even when combined with other conditioning mechanisms such as prompting. The results indicate that CLP provides a favorable trade-off between steerability and parameter efficiency, making it a promising approach for multi-objective fine-tuning.
Reach us at info@study.space