10 Jun 2024 | Zichun Yu, Spandan Das, Chenyan Xiong
The paper introduces MATES (Model-Aware Data Selection with Data Influence Models), a novel framework for enhancing the efficiency and effectiveness of language model pretraining through model-aware data selection. MATES uses a small data influence model to continuously adapt to the evolving data preferences of the main pretraining model, selecting the most effective training data for each stage of pretraining. The key contributions include:
1. **Model-Aware Data Selection Framework**: MATES combines data selection and model pretraining to maximize the final target performance.
2. **Oracle Data Influence Collection**: The framework locally probes the oracle data influence by evaluating the main model's performance on a reference task after training on individual data points.
3. **Data Influence Model Training**: A small BERT-based model is fine-tuned to approximate the oracle data influence, selecting the most effective data for the next pretraining stage.
4. **Empirical Validation**: Experiments on Pythia and the C4 dataset demonstrate that MATES significantly outperforms random data selection and recent data selection approaches that rely on larger reference models, achieving better zero-shot and few-shot performance across various downstream tasks.
The paper also discusses the limitations and future directions, including the need to better understand the combinational and accumulative nature of the pretraining process and the potential for scaling to production-level models with billions of parameters.The paper introduces MATES (Model-Aware Data Selection with Data Influence Models), a novel framework for enhancing the efficiency and effectiveness of language model pretraining through model-aware data selection. MATES uses a small data influence model to continuously adapt to the evolving data preferences of the main pretraining model, selecting the most effective training data for each stage of pretraining. The key contributions include:
1. **Model-Aware Data Selection Framework**: MATES combines data selection and model pretraining to maximize the final target performance.
2. **Oracle Data Influence Collection**: The framework locally probes the oracle data influence by evaluating the main model's performance on a reference task after training on individual data points.
3. **Data Influence Model Training**: A small BERT-based model is fine-tuned to approximate the oracle data influence, selecting the most effective data for the next pretraining stage.
4. **Empirical Validation**: Experiments on Pythia and the C4 dataset demonstrate that MATES significantly outperforms random data selection and recent data selection approaches that rely on larger reference models, achieving better zero-shot and few-shot performance across various downstream tasks.
The paper also discusses the limitations and future directions, including the need to better understand the combinational and accumulative nature of the pretraining process and the potential for scaling to production-level models with billions of parameters.