Understanding How to Train Data-Efficient LLMs

The paper explores data-efficient approaches for pre-training large language models (LLMs), focusing on techniques that optimize the Pareto frontier of model quality and training resource consumption. The authors investigate two main approaches: ASK-LLM, which leverages instruction-tuned LLMs to assess the quality of training examples, and DENSITY, which aims to maximize coverage and diversity in the feature space through a diversified sampling procedure. Through extensive experiments involving 19 different samplers and hundreds of evaluation tasks, the paper finds that ASK-LLM and DENSITY are the most effective methods in their respective categories. ASK-LLM consistently outperforms full-data training even when 90% of the original dataset is rejected, while converging up to 70% faster. The paper also discusses the trade-offs between coverage and quality, and the impact of sampling cost on downstream performance. The results suggest that LLM-based quality raters can significantly improve data efficiency and model performance.The paper explores data-efficient approaches for pre-training large language models (LLMs), focusing on techniques that optimize the Pareto frontier of model quality and training resource consumption. The authors investigate two main approaches: ASK-LLM, which leverages instruction-tuned LLMs to assess the quality of training examples, and DENSITY, which aims to maximize coverage and diversity in the feature space through a diversified sampling procedure. Through extensive experiments involving 19 different samplers and hundreds of evaluation tasks, the paper finds that ASK-LLM and DENSITY are the most effective methods in their respective categories. ASK-LLM consistently outperforms full-data training even when 90% of the original dataset is rejected, while converging up to 70% faster. The paper also discusses the trade-offs between coverage and quality, and the impact of sampling cost on downstream performance. The results suggest that LLM-based quality raters can significantly improve data efficiency and model performance.

How to Train Data-Efficient LLMs

15 Feb 2024 | Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng