Understanding Weak-to-Strong Reasoning

This paper explores the effectiveness of weak-to-strong learning in enhancing the reasoning capabilities of large language models (LLMs). Weak-to-strong learning leverages a less capable model to unlock the latent abilities of a stronger model, which is particularly valuable when full-scale and accurate supervision for complex reasoning tasks is challenging. The authors introduce a progressive learning framework that enables the strong model to autonomously refine its training data without requiring input from more advanced models or human-annotated data. The framework consists of two stages: supervised fine-tuning on a small but high-quality dataset and preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that this method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. The method is further validated on the highly challenging OlympicArena dataset, where Llama3-8b-instruct effectively supervises Llama3-70b. This work paves the way for more scalable and sophisticated strategies to enhance AI reasoning powers.This paper explores the effectiveness of weak-to-strong learning in enhancing the reasoning capabilities of large language models (LLMs). Weak-to-strong learning leverages a less capable model to unlock the latent abilities of a stronger model, which is particularly valuable when full-scale and accurate supervision for complex reasoning tasks is challenging. The authors introduce a progressive learning framework that enables the strong model to autonomously refine its training data without requiring input from more advanced models or human-annotated data. The framework consists of two stages: supervised fine-tuning on a small but high-quality dataset and preference optimization on contrastive samples identified by the strong model itself. Extensive experiments on the GSM8K and MATH datasets demonstrate that this method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models. The method is further validated on the highly challenging OlympicArena dataset, where Llama3-8b-instruct effectively supervises Llama3-70b. This work paves the way for more scalable and sophisticated strategies to enhance AI reasoning powers.

Weak-to-Strong Reasoning

18 Jul 2024 | Yuqing Yang, Yan Ma, Pengfei Liu