Weak-to-Strong Reasoning

Weak-to-Strong Reasoning

18 Jul 2024 | Yuqing Yang, Yan Ma, Pengfei Liu
This paper introduces a progressive learning framework for weak-to-strong reasoning, enabling a strong model to autonomously refine its training data without requiring input from more advanced models or human-annotated data. The framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. The method is validated on the GSM8K and MATH datasets, demonstrating significant improvements in reasoning capabilities of Llama2-70b using three separate weak models. Further validation is conducted on the OlympicArena dataset, where Llama3-8b-instruct effectively supervises Llama3-70b. The proposed method outperforms full weak fine-tuning, achieving a 26.99-point improvement on GSM8K and an additional 8.49 points through preference optimization. The method also enables the strong model to learn from errors made by the weak supervisor, surpassing the strong model fine-tuned on gold-standard solutions in challenging scenarios. The framework is shown to be effective in scenarios closer to future conditions, demonstrating robustness and generalizability. The work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers.This paper introduces a progressive learning framework for weak-to-strong reasoning, enabling a strong model to autonomously refine its training data without requiring input from more advanced models or human-annotated data. The framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. The method is validated on the GSM8K and MATH datasets, demonstrating significant improvements in reasoning capabilities of Llama2-70b using three separate weak models. Further validation is conducted on the OlympicArena dataset, where Llama3-8b-instruct effectively supervises Llama3-70b. The proposed method outperforms full weak fine-tuning, achieving a 26.99-point improvement on GSM8K and an additional 8.49 points through preference optimization. The method also enables the strong model to learn from errors made by the weak supervisor, surpassing the strong model fine-tuned on gold-standard solutions in challenging scenarios. The framework is shown to be effective in scenarios closer to future conditions, demonstrating robustness and generalizability. The work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers.
Reach us at info@study.space
[slides] Weak-to-Strong Reasoning | StudySpace