Scaling laws for learning with real and surrogate data

Scaling laws for learning with real and surrogate data

December 5, 2024 | Ayush Jain*, Andrea Montanari†, Eren Sasoglu*
This paper investigates the use of surrogate data in machine learning, focusing on how to optimally integrate them into training to reduce test error. Surrogate data, which may come from different sources or be synthetic, can be combined with real data to improve model performance. The study proposes a weighted empirical risk minimization (ERM) approach, where the weight determines the contribution of surrogate data to the training process. The key findings are: (i) integrating surrogate data can significantly reduce test error, even when the surrogate data is unrelated to the original data, due to the Stein's paradox. (ii) Optimally weighted ERM is crucial for leveraging surrogate data. (iii) The test error of models trained on mixtures of real and surrogate data follows a scaling law, which can be used to predict the optimal weighting scheme and the amount of surrogate data needed. Theoretical analysis is conducted under various models, including Gaussian sequence models, non-parametric regression, low-dimensional empirical risk minimization, and high-dimensional ridge regression. Empirical results are validated on datasets from different domains, including natural language processing, image classification, and survival analysis. The scaling law is shown to accurately predict test error behavior, and the optimal weight α can be determined using this law. The study also highlights the importance of choosing the right weight and the benefits of using surrogate data in scenarios where real data is scarce or expensive. The results demonstrate that the proposed scaling law provides a good approximation of the test error behavior, and that the optimal weight can be determined through validation data. The paper concludes that the scaling law is a useful tool for integrating heterogeneous data into training and improving model performance.This paper investigates the use of surrogate data in machine learning, focusing on how to optimally integrate them into training to reduce test error. Surrogate data, which may come from different sources or be synthetic, can be combined with real data to improve model performance. The study proposes a weighted empirical risk minimization (ERM) approach, where the weight determines the contribution of surrogate data to the training process. The key findings are: (i) integrating surrogate data can significantly reduce test error, even when the surrogate data is unrelated to the original data, due to the Stein's paradox. (ii) Optimally weighted ERM is crucial for leveraging surrogate data. (iii) The test error of models trained on mixtures of real and surrogate data follows a scaling law, which can be used to predict the optimal weighting scheme and the amount of surrogate data needed. Theoretical analysis is conducted under various models, including Gaussian sequence models, non-parametric regression, low-dimensional empirical risk minimization, and high-dimensional ridge regression. Empirical results are validated on datasets from different domains, including natural language processing, image classification, and survival analysis. The scaling law is shown to accurately predict test error behavior, and the optimal weight α can be determined using this law. The study also highlights the importance of choosing the right weight and the benefits of using surrogate data in scenarios where real data is scarce or expensive. The results demonstrate that the proposed scaling law provides a good approximation of the test error behavior, and that the optimal weight can be determined through validation data. The paper concludes that the scaling law is a useful tool for integrating heterogeneous data into training and improving model performance.
Reach us at info@study.space
[slides] Scaling laws for learning with real and surrogate data | StudySpace