19 Nov 2024 | Zhanhui Zhou*, Zhixuan Liu*, Jie Liu, Zhichen Dong, Chao Yang, Yu Qiao
The paper introduces a novel method called *weak-to-strong search* for aligning large language models (LLMs) with human preferences. This method frames the alignment process as a test-time greedy search, maximizing the log-probability difference between small tuned and untuned models while sampling from the frozen large model. The key contributions are:
1. **Model Up-Skaling Strategy**: It offers a compute-efficient approach to aligning large models without directly fine-tuning them, by leveraging small models as steering forces.
2. **Weak-to-Strong Generalization**: It enhances strong models with weak test-time guidance, demonstrating the potential for weak-to-strong generalization.
The method is evaluated on various tasks, including controlled sentiment generation, summarization, and instruction-following. Empirical results show that weak-to-strong search effectively improves the alignment of large models using only test-time guidance from small models. In particular, it outperforms other test-time approaches in controlled sentiment generation and summarization tasks, and consistently enhances the performance of large instruction-tuned models in the AlpacaEval 2.0 benchmark. The paper also discusses the limitations and future directions, including the potential for applying weak-to-strong search to tasks beyond human preference alignment, such as reasoning and coding.The paper introduces a novel method called *weak-to-strong search* for aligning large language models (LLMs) with human preferences. This method frames the alignment process as a test-time greedy search, maximizing the log-probability difference between small tuned and untuned models while sampling from the frozen large model. The key contributions are:
1. **Model Up-Skaling Strategy**: It offers a compute-efficient approach to aligning large models without directly fine-tuning them, by leveraging small models as steering forces.
2. **Weak-to-Strong Generalization**: It enhances strong models with weak test-time guidance, demonstrating the potential for weak-to-strong generalization.
The method is evaluated on various tasks, including controlled sentiment generation, summarization, and instruction-following. Empirical results show that weak-to-strong search effectively improves the alignment of large models using only test-time guidance from small models. In particular, it outperforms other test-time approaches in controlled sentiment generation and summarization tasks, and consistently enhances the performance of large instruction-tuned models in the AlpacaEval 2.0 benchmark. The paper also discusses the limitations and future directions, including the potential for applying weak-to-strong search to tasks beyond human preference alignment, such as reasoning and coding.