10 Jun 2024 | Zichun Yu, Spandan Das, Chenyan Xiong
This paper introduces MATES, a model-aware data selection framework for efficient pretraining of language models. MATES dynamically adapts to the evolving data preferences of the pretraining model by using a data influence model to select the most effective data for each pretraining stage. The data influence model is fine-tuned to approximate the oracle data preference signals collected by locally probing the pretraining model. Experiments on Pythia and the C4 dataset show that MATES significantly outperforms random data selection in both zero- and few-shot settings. It doubles the gains achieved by recent data selection approaches that leverage larger reference models and reduces the total FLOPs required to reach certain performances by half. The framework is evaluated on nine downstream tasks, demonstrating its effectiveness in improving pretraining efficiency and performance. The results show that MATES significantly elevates the scaling curves of pretraining models, reducing the FLOPs and pretraining steps required to reach a certain downstream performance by more than half. The paper also analyzes the effectiveness of locally probed oracle data influence and the design choices of data influence models. The results confirm the necessity of model-aware data selection and the effectiveness of locally probed oracle data influence. The framework is open-sourced and provides a new direction for future research in data selection for pretraining.This paper introduces MATES, a model-aware data selection framework for efficient pretraining of language models. MATES dynamically adapts to the evolving data preferences of the pretraining model by using a data influence model to select the most effective data for each pretraining stage. The data influence model is fine-tuned to approximate the oracle data preference signals collected by locally probing the pretraining model. Experiments on Pythia and the C4 dataset show that MATES significantly outperforms random data selection in both zero- and few-shot settings. It doubles the gains achieved by recent data selection approaches that leverage larger reference models and reduces the total FLOPs required to reach certain performances by half. The framework is evaluated on nine downstream tasks, demonstrating its effectiveness in improving pretraining efficiency and performance. The results show that MATES significantly elevates the scaling curves of pretraining models, reducing the FLOPs and pretraining steps required to reach a certain downstream performance by more than half. The paper also analyzes the effectiveness of locally probed oracle data influence and the design choices of data influence models. The results confirm the necessity of model-aware data selection and the effectiveness of locally probed oracle data influence. The framework is open-sourced and provides a new direction for future research in data selection for pretraining.