2 Apr 2024 | Yifan Zhang, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao
The paper introduces a novel strategy, Autonomous Data Selection (AutoDS), to enhance the proficiency of language models in mathematical reasoning through continual pretraining. AutoDS leverages base language models with zero-shot meta-prompts to autonomously evaluate and select high-quality mathematical content. Unlike traditional supervised fine-tuning or human-annotated classifiers, AutoDS uses a score function derived from logits to assess the educational value of content, enabling a more nuanced and granular evaluation. The method is evaluated on the MATH, GSM8K, and BIG-Bench Hard (BBH) tasks, demonstrating significant improvements in downstream performance with a 2 times reduction in pretraining token efficiency compared to state-of-the-art baselines. The AutoMathText dataset, curated for this purpose, is made available to the community. The paper also discusses the challenges and limitations of current approaches and highlights the potential of AutoDS in advancing AI systems' capabilities in specialized knowledge domains.The paper introduces a novel strategy, Autonomous Data Selection (AutoDS), to enhance the proficiency of language models in mathematical reasoning through continual pretraining. AutoDS leverages base language models with zero-shot meta-prompts to autonomously evaluate and select high-quality mathematical content. Unlike traditional supervised fine-tuning or human-annotated classifiers, AutoDS uses a score function derived from logits to assess the educational value of content, enabling a more nuanced and granular evaluation. The method is evaluated on the MATH, GSM8K, and BIG-Bench Hard (BBH) tasks, demonstrating significant improvements in downstream performance with a 2 times reduction in pretraining token efficiency compared to state-of-the-art baselines. The AutoMathText dataset, curated for this purpose, is made available to the community. The paper also discusses the challenges and limitations of current approaches and highlights the potential of AutoDS in advancing AI systems' capabilities in specialized knowledge domains.