[slides] Controllable Preference Optimization%3A Toward Controllable Multi-Objective Alignment

The paper "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment" addresses the challenge of aligning large language models (LLMs) with human preferences, particularly the "3H" (helpfulness, honesty, harmlessness) desiderata. The authors introduce Controllable Preference Optimization (CPO), a novel approach that explicitly specifies preference scores for different objectives, guiding the model to generate responses that meet these requirements. CPO consists of two stages: Controllable Preference Supervised Fine-Tuning (CPSFT) and Controllable Direct Preference Optimization (CDPO). CPSFT involves training the model using preference tokens to control specific preference conditions, while CDPO directly compares the human preference of given responses with a conditional multi-preference value, adjusting the probability of better responses. Experimental results on datasets like UltraFeedback and UltraSafety demonstrate that CPO achieves better controllability and performance in single objectives compared to existing methods, and it surpasses baseline methods in multi-objective alignment, achieving Pareto improvements. The study highlights the importance of explicit conditioning in multi-objective optimization and provides a practical solution to mitigate the "alignment tax" in LLMs.The paper "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment" addresses the challenge of aligning large language models (LLMs) with human preferences, particularly the "3H" (helpfulness, honesty, harmlessness) desiderata. The authors introduce Controllable Preference Optimization (CPO), a novel approach that explicitly specifies preference scores for different objectives, guiding the model to generate responses that meet these requirements. CPO consists of two stages: Controllable Preference Supervised Fine-Tuning (CPSFT) and Controllable Direct Preference Optimization (CDPO). CPSFT involves training the model using preference tokens to control specific preference conditions, while CDPO directly compares the human preference of given responses with a conditional multi-preference value, adjusting the probability of better responses. Experimental results on datasets like UltraFeedback and UltraSafety demonstrate that CPO achieves better controllability and performance in single objectives compared to existing methods, and it surpasses baseline methods in multi-objective alignment, achieving Pareto improvements. The study highlights the importance of explicit conditioning in multi-objective optimization and provides a practical solution to mitigate the "alignment tax" in LLMs.

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

29 Feb 2024 | Yiju Guo*, Ganqu Cui*, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun

29 Feb 2024 | Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun