23 Jan 2024 | Logan Engstrom, Axel Feldmann, Aleksander Madry
DsDM: Model-Aware Dataset Selection with Datamodels
Logan Engstrom, Axel Feldmann, Aleksander Mądry
Abstract: When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. However, in practice, selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. To develop better methods for selecting data, we frame dataset selection as an optimization problem that we can directly solve for: given target tasks, a learning algorithm, and candidate data, select the subset that maximizes model performance. This framework avoids handpicked notions of data quality and models explicitly how the learning process uses train datapoints to predict on the target tasks. Our resulting method greatly improves language model (LM) performance on both pre-specified tasks and previously unseen tasks. Specifically, choosing target tasks representative of standard LM problems and evaluating on diverse held-out benchmarks, our selected datasets provide a 2× compute multiplier over baseline methods.
Introduction: Training large-scale machine learning models requires selecting appropriate data. While training on vast amounts of data yields models that generalize well, not all data is equally useful. In practice, we tend to filter training data according to intuitive notions of quality, such as choosing documents similar to a "high quality" data source like Wikipedia. However, these samples may not improve performance in practice. We find that selecting data according to similarity with "high quality" data sources may not improve (and can even hurt) model performance. To develop better methods for selecting training data, we start from first principles, framing dataset selection as an optimization problem where the goal is to select the data that maximizes model performance. We approximate the optimal subset by modeling how the learning algorithm uses training data to predict. Our resulting method, dataset selection with datamodels (DsDM), consistently improves language model performance on diverse target tasks.
Evaluation of DsDM: DsDM consistently reduces LM target task loss in practice. Baseline methods often do not outperform randomly selecting data. DsDM outperforms baselines on a diverse set of held-out benchmarks, delivering models that perform as well as those trained with 2× the compute budget on randomly selected data. DsDM also improves performance on tasks related to the target tasks while not reducing performance on unrelated categories.
Selecting data for broad model capabilities: DsDM improves performance on pre-specified target tasks. However, when training large-scale models, our hope is that they will perform well on yet unseen tasks. Our framework suggests a principled approach to selecting data in this scenario: choose target tasks similar to those we expect at deployment time, then select the optimal dataset subset for these target tasks. DsDM outperforms all baselines across compute budgets and matches training with 2× the compute on randomly selected data.
Discussion: DsDM is a 2× compute multiplierDsDM: Model-Aware Dataset Selection with Datamodels
Logan Engstrom, Axel Feldmann, Aleksander Mądry
Abstract: When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. However, in practice, selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. To develop better methods for selecting data, we frame dataset selection as an optimization problem that we can directly solve for: given target tasks, a learning algorithm, and candidate data, select the subset that maximizes model performance. This framework avoids handpicked notions of data quality and models explicitly how the learning process uses train datapoints to predict on the target tasks. Our resulting method greatly improves language model (LM) performance on both pre-specified tasks and previously unseen tasks. Specifically, choosing target tasks representative of standard LM problems and evaluating on diverse held-out benchmarks, our selected datasets provide a 2× compute multiplier over baseline methods.
Introduction: Training large-scale machine learning models requires selecting appropriate data. While training on vast amounts of data yields models that generalize well, not all data is equally useful. In practice, we tend to filter training data according to intuitive notions of quality, such as choosing documents similar to a "high quality" data source like Wikipedia. However, these samples may not improve performance in practice. We find that selecting data according to similarity with "high quality" data sources may not improve (and can even hurt) model performance. To develop better methods for selecting training data, we start from first principles, framing dataset selection as an optimization problem where the goal is to select the data that maximizes model performance. We approximate the optimal subset by modeling how the learning algorithm uses training data to predict. Our resulting method, dataset selection with datamodels (DsDM), consistently improves language model performance on diverse target tasks.
Evaluation of DsDM: DsDM consistently reduces LM target task loss in practice. Baseline methods often do not outperform randomly selecting data. DsDM outperforms baselines on a diverse set of held-out benchmarks, delivering models that perform as well as those trained with 2× the compute budget on randomly selected data. DsDM also improves performance on tasks related to the target tasks while not reducing performance on unrelated categories.
Selecting data for broad model capabilities: DsDM improves performance on pre-specified target tasks. However, when training large-scale models, our hope is that they will perform well on yet unseen tasks. Our framework suggests a principled approach to selecting data in this scenario: choose target tasks similar to those we expect at deployment time, then select the optimal dataset subset for these target tasks. DsDM outperforms all baselines across compute budgets and matches training with 2× the compute on randomly selected data.
Discussion: DsDM is a 2× compute multiplier