LESS: Selecting Influential Data for Targeted Instruction Tuning

LESS: Selecting Influential Data for Targeted Instruction Tuning

2024 | Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen
LESS is an optimizer-aware and efficient algorithm for selecting influential data for targeted instruction tuning in large language models (LLMs). The goal is to identify the most relevant data from extensive datasets to develop specific capabilities. LESS constructs a gradient datastore with low-dimensional gradient features and selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data often outperforms training on the full dataset across diverse downstream tasks. The selected data is highly transferable, as smaller models can be used to select useful data for larger models and models from different families. Qualitative analysis shows that LESS selects data that exemplifies the necessary reasoning skills for the intended downstream application. The method is efficient, interpretable, and transferable, and is evaluated on three diverse downstream datasets—MMLU, TyDIQA, and BBH. Results show that LESS often selects a small subset of the data (5%) that outperforms training on the full dataset, and the selected subset remains universally effective across model scales and families. LESS is the only consistently effective approach compared to other data selection methods, justifying its relatively high computational cost. The method is implemented with LoRA and random projections to construct a gradient datastore with low-dimensional, easily manipulable gradient features. LESS is effective across different models and outperforms random selection by 2 to 5 points across all models and evaluation datasets. Data selected using a small model boosts performance for larger and different models. LESS is the only consistently effective approach compared to others. The method is also effective for smaller models and tasks, and is shown to be more effective than baselines such as BM25, DSIR, and RDS. LESS is also effective for Pythia models across different scales. The method is efficient, interpretable, and transferable, and is evaluated on three diverse downstream datasets—MMLU, TyDIQA, and BBH. Results show that LESS often selects a small subset of the data (5%) that outperforms training on the full dataset, and the selected subset remains universally effective across model scales and families. LESS is the only consistently effective approach compared to others. The method is also effective for smaller models and tasks, and is shown to be more effective than baselines such as BM25, DSIR, and RDS. LESS is also effective for Pythia models across different scales.LESS is an optimizer-aware and efficient algorithm for selecting influential data for targeted instruction tuning in large language models (LLMs). The goal is to identify the most relevant data from extensive datasets to develop specific capabilities. LESS constructs a gradient datastore with low-dimensional gradient features and selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data often outperforms training on the full dataset across diverse downstream tasks. The selected data is highly transferable, as smaller models can be used to select useful data for larger models and models from different families. Qualitative analysis shows that LESS selects data that exemplifies the necessary reasoning skills for the intended downstream application. The method is efficient, interpretable, and transferable, and is evaluated on three diverse downstream datasets—MMLU, TyDIQA, and BBH. Results show that LESS often selects a small subset of the data (5%) that outperforms training on the full dataset, and the selected subset remains universally effective across model scales and families. LESS is the only consistently effective approach compared to other data selection methods, justifying its relatively high computational cost. The method is implemented with LoRA and random projections to construct a gradient datastore with low-dimensional, easily manipulable gradient features. LESS is effective across different models and outperforms random selection by 2 to 5 points across all models and evaluation datasets. Data selected using a small model boosts performance for larger and different models. LESS is the only consistently effective approach compared to others. The method is also effective for smaller models and tasks, and is shown to be more effective than baselines such as BM25, DSIR, and RDS. LESS is also effective for Pythia models across different scales. The method is efficient, interpretable, and transferable, and is evaluated on three diverse downstream datasets—MMLU, TyDIQA, and BBH. Results show that LESS often selects a small subset of the data (5%) that outperforms training on the full dataset, and the selected subset remains universally effective across model scales and families. LESS is the only consistently effective approach compared to others. The method is also effective for smaller models and tasks, and is shown to be more effective than baselines such as BM25, DSIR, and RDS. LESS is also effective for Pythia models across different scales.
Reach us at info@study.space