The paper introduces LESS, an algorithm designed to select influential data for targeted instruction tuning in large language models (LLMs). Targeted instruction tuning aims to develop specific skills, such as reasoning, from extensive datasets. LESS addresses the challenge of identifying relevant data by estimating data influences and performing Low-rank gradiEnt Similarity Search (LESS). The key contributions of LESS include:
1. **Optimizer-Aware and Efficient**: LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data.
2. **Gradient DataStore**: It constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features.
3. **Efficient Data Selection**: LESS selects examples based on their similarity to few-shot examples embodying specific capabilities.
4. **Transferability**: Selected data is highly transferable, allowing smaller models to select useful data for larger models and different model families.
5. **Interpretable**: LESS selects data that exemplifies the necessary reasoning skills for the intended downstream application, rather than relying on surface form cues.
Experiments on diverse downstream datasets show that training on 5% of the data selected by LESS often outperforms training on the full dataset. LESS also demonstrates strong performance across different model scales and families, and its selected data remains effective even when used with larger or different models. The method is evaluated on MMLU, TyDiQA, and BBH datasets, and its effectiveness is compared with various baselines. LESS is shown to consistently outperform random selection and other data selection methods, highlighting its effectiveness in selecting relevant data for targeted instruction tuning.The paper introduces LESS, an algorithm designed to select influential data for targeted instruction tuning in large language models (LLMs). Targeted instruction tuning aims to develop specific skills, such as reasoning, from extensive datasets. LESS addresses the challenge of identifying relevant data by estimating data influences and performing Low-rank gradiEnt Similarity Search (LESS). The key contributions of LESS include:
1. **Optimizer-Aware and Efficient**: LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data.
2. **Gradient DataStore**: It constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features.
3. **Efficient Data Selection**: LESS selects examples based on their similarity to few-shot examples embodying specific capabilities.
4. **Transferability**: Selected data is highly transferable, allowing smaller models to select useful data for larger models and different model families.
5. **Interpretable**: LESS selects data that exemplifies the necessary reasoning skills for the intended downstream application, rather than relying on surface form cues.
Experiments on diverse downstream datasets show that training on 5% of the data selected by LESS often outperforms training on the full dataset. LESS also demonstrates strong performance across different model scales and families, and its selected data remains effective even when used with larger or different models. The method is evaluated on MMLU, TyDiQA, and BBH datasets, and its effectiveness is compared with various baselines. LESS is shown to consistently outperform random selection and other data selection methods, highlighting its effectiveness in selecting relevant data for targeted instruction tuning.