**MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models**
**Authors:** Kailai Yang
**Affiliations:** The University of Manchester, The Fin AI
**Contact Information:** kailai.yang, zhiwei.liu, sophia.ananiadou}@manchester.ac.uk, xqq.sincere, zhangtianlin668}@gmail.com, jimin@chancefocus.com
**Abstract:**
Recent advancements in large language models (LLMs) aim to align with human expectations and values through multi-objective preference alignment. However, existing methods are parameter-dependent, leading to high costs and limited adaptability to new objectives. This work introduces Meta-Objective Aligner (MetaAligner), a policy-agnostic and generalizable method for multi-objective preference alignment. MetaAligner performs conditional weak-to-strong correction for weak responses, enabling plug-and-play alignment and zero-shot preference alignment for unseen objectives. Experimental results show that MetaAligner achieves significant improvements on 10 state-of-the-art policy models, outperforming previous methods with up to 15.71× less GPU training hours. MetaAligner also effectively aligns unseen objectives, marking a significant step towards generalizable multi-objective preference alignment.
**Contributions:**
1. **MetaAligner:** The first policy-agnostic and generalizable method for multi-objective preference alignment, achieving efficient and stable alignment without tuning policy models.
2. **Zero-shot Preference Alignment:** Successful alignment of unseen objectives, demonstrating the potential for generalizable multi-objective preference alignment.
3. **Performance on Various Models:** Substantial improvements on multiple datasets and policy models, including large-scale models like GPT-3.5 and Claude-3.
**Experiments:**
- **Datasets:** IHH-RLHF, UltraFeedback, IMHI.
- **Models:** TinyLLaMA-1.1B, LLaMA2-(7B, 13B, 70B), Vicuna-(7B, 13B, 33B), Gemma-instruct-(2B, 7B), MentalLLaMA-(7B, 13B, 33B), GPT-3.5, Claude-3.
- **Evaluation Metrics:** Win rates against ground-truth responses.
**Results:**
- **Efficiency:** MetaAligner achieves significant improvements with less GPU training hours compared to previous methods.
- **Generalizability:** Effective alignment of unseen objectives, maintaining performance on aligned objectives.
- **Scalability:** Performance scales with model size, showing better alignment on larger models.
**Limitations and Future Work:**
- **Computational Burden:** Increased computational costs during inference.
- **Generalizability:** Limited testing on more unseen objectives.
- **Future Research:** Exploring domain-specific alignment scenarios, evaluating scalability, and expanding the landscape of generalizable**MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models**
**Authors:** Kailai Yang
**Affiliations:** The University of Manchester, The Fin AI
**Contact Information:** kailai.yang, zhiwei.liu, sophia.ananiadou}@manchester.ac.uk, xqq.sincere, zhangtianlin668}@gmail.com, jimin@chancefocus.com
**Abstract:**
Recent advancements in large language models (LLMs) aim to align with human expectations and values through multi-objective preference alignment. However, existing methods are parameter-dependent, leading to high costs and limited adaptability to new objectives. This work introduces Meta-Objective Aligner (MetaAligner), a policy-agnostic and generalizable method for multi-objective preference alignment. MetaAligner performs conditional weak-to-strong correction for weak responses, enabling plug-and-play alignment and zero-shot preference alignment for unseen objectives. Experimental results show that MetaAligner achieves significant improvements on 10 state-of-the-art policy models, outperforming previous methods with up to 15.71× less GPU training hours. MetaAligner also effectively aligns unseen objectives, marking a significant step towards generalizable multi-objective preference alignment.
**Contributions:**
1. **MetaAligner:** The first policy-agnostic and generalizable method for multi-objective preference alignment, achieving efficient and stable alignment without tuning policy models.
2. **Zero-shot Preference Alignment:** Successful alignment of unseen objectives, demonstrating the potential for generalizable multi-objective preference alignment.
3. **Performance on Various Models:** Substantial improvements on multiple datasets and policy models, including large-scale models like GPT-3.5 and Claude-3.
**Experiments:**
- **Datasets:** IHH-RLHF, UltraFeedback, IMHI.
- **Models:** TinyLLaMA-1.1B, LLaMA2-(7B, 13B, 70B), Vicuna-(7B, 13B, 33B), Gemma-instruct-(2B, 7B), MentalLLaMA-(7B, 13B, 33B), GPT-3.5, Claude-3.
- **Evaluation Metrics:** Win rates against ground-truth responses.
**Results:**
- **Efficiency:** MetaAligner achieves significant improvements with less GPU training hours compared to previous methods.
- **Generalizability:** Effective alignment of unseen objectives, maintaining performance on aligned objectives.
- **Scalability:** Performance scales with model size, showing better alignment on larger models.
**Limitations and Future Work:**
- **Computational Burden:** Increased computational costs during inference.
- **Generalizability:** Limited testing on more unseen objectives.
- **Future Research:** Exploring domain-specific alignment scenarios, evaluating scalability, and expanding the landscape of generalizable