2024 | Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng
This paper presents data-efficient methods for pre-training large language models (LLMs), focusing on optimizing the trade-off between model quality and training resource usage. The authors propose two key techniques: ASK-LLM and DENSITY. ASK-LLM uses instruction-tuned LLMs to assess the quality of training examples, while DENSITY employs a diversified sampling approach to maximize coverage of latent topics in the data. The study evaluates 19 different sampling strategies on T5-Large and T5-Small models, showing that ASK-LLM and DENSITY outperform other methods in their respective categories. ASK-LLM consistently produces better models even after removing up to 90% of the training data, while DENSITY recovers the performance of full data. The results indicate that data curation can significantly improve the Pareto frontier of the data-quality vs. training cost trade-off, enabling higher-quality models to be trained with less data. The study also highlights the importance of considering both coverage and quality in data selection, and shows that LLM-based quality raters can be effective in pre-training. The findings suggest that data-efficient training methods can reduce training costs, improve model performance, and enhance the efficiency of LLM training.This paper presents data-efficient methods for pre-training large language models (LLMs), focusing on optimizing the trade-off between model quality and training resource usage. The authors propose two key techniques: ASK-LLM and DENSITY. ASK-LLM uses instruction-tuned LLMs to assess the quality of training examples, while DENSITY employs a diversified sampling approach to maximize coverage of latent topics in the data. The study evaluates 19 different sampling strategies on T5-Large and T5-Small models, showing that ASK-LLM and DENSITY outperform other methods in their respective categories. ASK-LLM consistently produces better models even after removing up to 90% of the training data, while DENSITY recovers the performance of full data. The results indicate that data curation can significantly improve the Pareto frontier of the data-quality vs. training cost trade-off, enabling higher-quality models to be trained with less data. The study also highlights the importance of considering both coverage and quality in data selection, and shows that LLM-based quality raters can be effective in pre-training. The findings suggest that data-efficient training methods can reduce training costs, improve model performance, and enhance the efficiency of LLM training.