[slides] Conditional Language Policy%3A A General Framework For Steerable Multi-Objective Finetuning

The paper introduces Conditioned Language Policies (CLP), a general framework for multi-objective finetuning of language models. CLP addresses the challenge of developing steerable models that can trade off multiple conflicting objectives, such as creativity and safety, in a flexible and efficient manner. Unlike traditional single-objective finetuning, which requires retraining for different objectives, CLP uses multi-task training and parameter-efficient finetuning to learn models that can adapt to various reward weightings at inference time. The framework is evaluated through extensive experiments on summarization tasks, demonstrating that CLP outperforms existing state-of-the-art approaches in both output quality and steerability. The paper also provides theoretical insights, showing that zero-shot methods can be near-optimal under specific conditions but fail when policies for individual rewards do not align. CLP is shown to robustly maintain its benefits across different experimental conditions, including varying reward functions and model sizes.The paper introduces Conditioned Language Policies (CLP), a general framework for multi-objective finetuning of language models. CLP addresses the challenge of developing steerable models that can trade off multiple conflicting objectives, such as creativity and safety, in a flexible and efficient manner. Unlike traditional single-objective finetuning, which requires retraining for different objectives, CLP uses multi-task training and parameter-efficient finetuning to learn models that can adapt to various reward weightings at inference time. The framework is evaluated through extensive experiments on summarization tasks, demonstrating that CLP outperforms existing state-of-the-art approaches in both output quality and steerability. The paper also provides theoretical insights, showing that zero-shot methods can be near-optimal under specific conditions but fail when policies for individual rewards do not align. CLP is shown to robustly maintain its benefits across different experimental conditions, including varying reward functions and model sizes.

Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning