29 Feb 2024 | Yijju Guo, Ganqu Cui, Lifen Yuan, Ning Ding, Jixxin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun
Controllable Preference Optimization (CPO) addresses the challenge of aligning large language models (LLMs) with multiple human preferences, such as helpfulness, honesty, and harmlessness, while mitigating the "alignment tax"—the trade-off where improving one preference may harm another. Existing alignment methods are unidirectional, leading to suboptimal trade-offs. CPO introduces explicit preference conditions to guide LLMs, enabling controllable multi-objective alignment. The method consists of two stages: (1) Controllable Preference Supervised Fine-tuning (CPSFT), which incorporates preference tokens into the input to guide model responses, and (2) Controllable Direct Preference Optimization (CDPO), which directly compares human preferences under conditional multi-preference values to optimize responses.
Experiments show that CPO achieves better controllability and performance across multiple objectives compared to existing methods like DPO and SFT. It outperforms these methods in aligning with helpfulness, honesty, and harmlessness, demonstrating its ability to reduce conflicts between alignment objectives. CPO is evaluated on datasets such as UltraFeedback and UltraSafety, and results show that it achieves high performance in all three objectives, including safety. The method also shows strong performance in multi-objective alignment, achieving Pareto improvements.
The study highlights the importance of controllability in multi-objective alignment, arguing that it is not possible to please all users all the time. By explicitly grounding LLMs with preference conditions, CPO enables more precise alignment with specific user preferences. The method is effective in reducing the alignment tax and improving the overall performance of LLMs in aligning with human values. The results demonstrate that CPO is a promising approach for achieving controllable and effective multi-objective alignment in LLMs.Controllable Preference Optimization (CPO) addresses the challenge of aligning large language models (LLMs) with multiple human preferences, such as helpfulness, honesty, and harmlessness, while mitigating the "alignment tax"—the trade-off where improving one preference may harm another. Existing alignment methods are unidirectional, leading to suboptimal trade-offs. CPO introduces explicit preference conditions to guide LLMs, enabling controllable multi-objective alignment. The method consists of two stages: (1) Controllable Preference Supervised Fine-tuning (CPSFT), which incorporates preference tokens into the input to guide model responses, and (2) Controllable Direct Preference Optimization (CDPO), which directly compares human preferences under conditional multi-preference values to optimize responses.
Experiments show that CPO achieves better controllability and performance across multiple objectives compared to existing methods like DPO and SFT. It outperforms these methods in aligning with helpfulness, honesty, and harmlessness, demonstrating its ability to reduce conflicts between alignment objectives. CPO is evaluated on datasets such as UltraFeedback and UltraSafety, and results show that it achieves high performance in all three objectives, including safety. The method also shows strong performance in multi-objective alignment, achieving Pareto improvements.
The study highlights the importance of controllability in multi-objective alignment, arguing that it is not possible to please all users all the time. By explicitly grounding LLMs with preference conditions, CPO enables more precise alignment with specific user preferences. The method is effective in reducing the alignment tax and improving the overall performance of LLMs in aligning with human values. The results demonstrate that CPO is a promising approach for achieving controllable and effective multi-objective alignment in LLMs.