2 Apr 2024 | Yifan Zhang, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao
This paper introduces a novel approach called Autonomous Data Selection (AutoDS) for improving the mathematical reasoning capabilities of language models. The method leverages base language models with zero-shot meta-prompts to autonomously evaluate and select high-quality mathematical content. By utilizing the logits of specific tokens, the method formulates a quantitative score function that enables the model to focus on the most informative data points. This approach significantly improves pretraining token efficiency, achieving a 2 times increase compared to state-of-the-art baselines. The AutoDS method is demonstrated through the continuous pretraining of a 7B-parameter language model on the AutoMathText dataset, which is an open-source dataset designed to enrich AI model training with mathematical content. The results show substantial improvements in downstream performance on the MATH, GSM8K, and BIG-Bench Hard tasks. The AutoDS method also addresses the shortage of labeled high-quality mathematical training resources by introducing the AutoMathText dataset. The method avoids the need for human-annotated data and instead uses the intrinsic capabilities of language models for autonomous content evaluation. This approach enables a refined and sophisticated training strategy that extends beyond the limitations of binary filtering. The AutoDS method is shown to be effective in enhancing the performance of language models in mathematical reasoning tasks, with the potential to improve models' mathematical reasoning capabilities. The method is also demonstrated to be effective in other cognitive domains, such as commonsense reasoning, world knowledge, and reading comprehension. The AutoDS method is a significant advancement in the field of language modeling, offering a new approach to autonomous data selection and model training. The method is particularly effective in the domain of mathematics, where the development and curation of specialized datasets for pretraining and fine-tuning represent a critical need and a challenge. The AutoDS method is a promising solution to this challenge, offering a scalable and objective mechanism for content assessment. The method is also shown to be effective in reducing the cost of data annotation, with the potential to significantly reduce the cost of training language models in mathematical reasoning tasks. The AutoDS method is a significant contribution to the field of language modeling, offering a new approach to autonomous data selection and model training. The method is particularly effective in the domain of mathematics, where the development and curation of specialized datasets for pretraining and fine-tuning represent a critical need and a challenge. The AutoDS method is a promising solution to this challenge, offering a scalable and objective mechanism for content assessment. The method is also shown to be effective in reducing the cost of data annotation, with the potential to significantly reduce the cost of training language models in mathematical reasoning tasks.This paper introduces a novel approach called Autonomous Data Selection (AutoDS) for improving the mathematical reasoning capabilities of language models. The method leverages base language models with zero-shot meta-prompts to autonomously evaluate and select high-quality mathematical content. By utilizing the logits of specific tokens, the method formulates a quantitative score function that enables the model to focus on the most informative data points. This approach significantly improves pretraining token efficiency, achieving a 2 times increase compared to state-of-the-art baselines. The AutoDS method is demonstrated through the continuous pretraining of a 7B-parameter language model on the AutoMathText dataset, which is an open-source dataset designed to enrich AI model training with mathematical content. The results show substantial improvements in downstream performance on the MATH, GSM8K, and BIG-Bench Hard tasks. The AutoDS method also addresses the shortage of labeled high-quality mathematical training resources by introducing the AutoMathText dataset. The method avoids the need for human-annotated data and instead uses the intrinsic capabilities of language models for autonomous content evaluation. This approach enables a refined and sophisticated training strategy that extends beyond the limitations of binary filtering. The AutoDS method is shown to be effective in enhancing the performance of language models in mathematical reasoning tasks, with the potential to improve models' mathematical reasoning capabilities. The method is also demonstrated to be effective in other cognitive domains, such as commonsense reasoning, world knowledge, and reading comprehension. The AutoDS method is a significant advancement in the field of language modeling, offering a new approach to autonomous data selection and model training. The method is particularly effective in the domain of mathematics, where the development and curation of specialized datasets for pretraining and fine-tuning represent a critical need and a challenge. The AutoDS method is a promising solution to this challenge, offering a scalable and objective mechanism for content assessment. The method is also shown to be effective in reducing the cost of data annotation, with the potential to significantly reduce the cost of training language models in mathematical reasoning tasks. The AutoDS method is a significant contribution to the field of language modeling, offering a new approach to autonomous data selection and model training. The method is particularly effective in the domain of mathematics, where the development and curation of specialized datasets for pretraining and fine-tuning represent a critical need and a challenge. The AutoDS method is a promising solution to this challenge, offering a scalable and objective mechanism for content assessment. The method is also shown to be effective in reducing the cost of data annotation, with the potential to significantly reduce the cost of training language models in mathematical reasoning tasks.