23 Jan 2024 | Logan Engstrom, Axel Feldmann, Aleksander Madry
The paper "DSDM: Model-Aware Dataset Selection with Datamodels" addresses the challenge of selecting training data for large-scale models, particularly language models (LMs). Traditional methods often filter data based on human notions of quality, but this approach can sometimes degrade model performance. The authors propose a novel method, DsDM (Dataset Selection with Datamodels), which frames dataset selection as an optimization problem. DsDM aims to select the subset of data that maximizes model performance on target tasks, avoiding the need for subjective notions of data quality. By using datamodels, a framework that approximates how the learning algorithm uses training data to predict, DsDM can efficiently estimate the optimal dataset selection. The method is shown to significantly improve LM performance on both pre-specified and previously unseen tasks, achieving a 2× compute multiplier over baseline methods. The paper also discusses the broader implications of dataset selection, suggesting that it can be used to fine-tune various downstream properties of trained models, such as fairness and domain-specific capabilities.The paper "DSDM: Model-Aware Dataset Selection with Datamodels" addresses the challenge of selecting training data for large-scale models, particularly language models (LMs). Traditional methods often filter data based on human notions of quality, but this approach can sometimes degrade model performance. The authors propose a novel method, DsDM (Dataset Selection with Datamodels), which frames dataset selection as an optimization problem. DsDM aims to select the subset of data that maximizes model performance on target tasks, avoiding the need for subjective notions of data quality. By using datamodels, a framework that approximates how the learning algorithm uses training data to predict, DsDM can efficiently estimate the optimal dataset selection. The method is shown to significantly improve LM performance on both pre-specified and previously unseen tasks, achieving a 2× compute multiplier over baseline methods. The paper also discusses the broader implications of dataset selection, suggesting that it can be used to fine-tune various downstream properties of trained models, such as fairness and domain-specific capabilities.