Understanding MetaAligner%3A Towards Generalizable Multi-Objective Alignment of Language Models

**MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models** **Authors:** Kailai Yang **Affiliations:** The University of Manchester, The Fin AI **Contact Information:** kailai.yang, zhiwei.liu, sophia.ananiadou}@manchester.ac.uk, xqq.sincere, zhangtianlin668}@gmail.com, jimin@chancefocus.com **Abstract:** Recent advancements in large language models (LLMs) aim to align with human expectations and values through multi-objective preference alignment. However, existing methods are parameter-dependent, leading to high costs and limited adaptability to new objectives. This work introduces Meta-Objective Aligner (MetaAligner), a policy-agnostic and generalizable method for multi-objective preference alignment. MetaAligner performs conditional weak-to-strong correction for weak responses, enabling plug-and-play alignment and zero-shot preference alignment for unseen objectives. Experimental results show that MetaAligner achieves significant improvements on 10 state-of-the-art policy models, outperforming previous methods with up to 15.71× less GPU training hours. MetaAligner also effectively aligns unseen objectives, marking a significant step towards generalizable multi-objective preference alignment. **Contributions:** 1. **MetaAligner:** The first policy-agnostic and generalizable method for multi-objective preference alignment, achieving efficient and stable alignment without tuning policy models. 2. **Zero-shot Preference Alignment:** Successful alignment of unseen objectives, demonstrating the potential for generalizable multi-objective preference alignment. 3. **Performance on Various Models:** Substantial improvements on multiple datasets and policy models, including large-scale models like GPT-3.5 and Claude-3. **Experiments:** - **Datasets:** IHH-RLHF, UltraFeedback, IMHI. - **Models:** TinyLLaMA-1.1B, LLaMA2-(7B, 13B, 70B), Vicuna-(7B, 13B, 33B), Gemma-instruct-(2B, 7B), MentalLLaMA-(7B, 13B, 33B), GPT-3.5, Claude-3. - **Evaluation Metrics:** Win rates against ground-truth responses. **Results:** - **Efficiency:** MetaAligner achieves significant improvements with less GPU training hours compared to previous methods. - **Generalizability:** Effective alignment of unseen objectives, maintaining performance on aligned objectives. - **Scalability:** Performance scales with model size, showing better alignment on larger models. **Limitations and Future Work:** - **Computational Burden:** Increased computational costs during inference. - **Generalizability:** Limited testing on more unseen objectives. - **Future Research:** Exploring domain-specific alignment scenarios, evaluating scalability, and expanding the landscape of generalizable**MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models** **Authors:** Kailai Yang **Affiliations:** The University of Manchester, The Fin AI **Contact Information:** kailai.yang, zhiwei.liu, sophia.ananiadou}@manchester.ac.uk, xqq.sincere, zhangtianlin668}@gmail.com, jimin@chancefocus.com **Abstract:** Recent advancements in large language models (LLMs) aim to align with human expectations and values through multi-objective preference alignment. However, existing methods are parameter-dependent, leading to high costs and limited adaptability to new objectives. This work introduces Meta-Objective Aligner (MetaAligner), a policy-agnostic and generalizable method for multi-objective preference alignment. MetaAligner performs conditional weak-to-strong correction for weak responses, enabling plug-and-play alignment and zero-shot preference alignment for unseen objectives. Experimental results show that MetaAligner achieves significant improvements on 10 state-of-the-art policy models, outperforming previous methods with up to 15.71× less GPU training hours. MetaAligner also effectively aligns unseen objectives, marking a significant step towards generalizable multi-objective preference alignment. **Contributions:** 1. **MetaAligner:** The first policy-agnostic and generalizable method for multi-objective preference alignment, achieving efficient and stable alignment without tuning policy models. 2. **Zero-shot Preference Alignment:** Successful alignment of unseen objectives, demonstrating the potential for generalizable multi-objective preference alignment. 3. **Performance on Various Models:** Substantial improvements on multiple datasets and policy models, including large-scale models like GPT-3.5 and Claude-3. **Experiments:** - **Datasets:** IHH-RLHF, UltraFeedback, IMHI. - **Models:** TinyLLaMA-1.1B, LLaMA2-(7B, 13B, 70B), Vicuna-(7B, 13B, 33B), Gemma-instruct-(2B, 7B), MentalLLaMA-(7B, 13B, 33B), GPT-3.5, Claude-3. - **Evaluation Metrics:** Win rates against ground-truth responses. **Results:** - **Efficiency:** MetaAligner achieves significant improvements with less GPU training hours compared to previous methods. - **Generalizability:** Effective alignment of unseen objectives, maintaining performance on aligned objectives. - **Scalability:** Performance scales with model size, showing better alignment on larger models. **Limitations and Future Work:** - **Computational Burden:** Increased computational costs during inference. - **Generalizability:** Limited testing on more unseen objectives. - **Future Research:** Exploring domain-specific alignment scenarios, evaluating scalability, and expanding the landscape of generalizable

MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models

6 May 2024 | Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Tianlin Zhang, Sophia Ananiadou