[slides and audio] Scaling laws for learning with real and surrogate data

The paper explores the integration of surrogate data into machine learning training to improve performance on the original distribution. Surrogate data, which is collected from different sources or synthetic models, is combined with real data to enhance the training process. The authors propose a weighted empirical risk minimization (ERM) approach, where the weight of the surrogate data is optimized to minimize the test error. Key findings include: 1. **Reduction in Test Error**: Integrating surrogate data can significantly reduce the test error on the original distribution, even when the surrogate data is unrelated to the original data. This improvement is attributed to Stein's paradox, which suggests that shrinking the empirical mean towards an arbitrary point can outperform the naive empirical mean. 2. **Optimal Weighting**: To leverage the benefits of surrogate data, it is crucial to use optimally weighted ERM. The optimal weight, denoted as \(\alpha\), can be tuned using cross-validation or empirical methods. 3. **Scaling Law**: The test error of models trained on mixtures of real and surrogate data follows a scaling law that can be used to predict the optimal weighting scheme and the amount of surrogate data to add. This scaling law is derived under several classical statistical models and validated through empirical experiments. 4. **Empirical Validation**: Theoretical results are supported by experiments on simulated and real-world datasets, including natural language processing, image classification, and survival analysis. The scaling law accurately predicts the test error for various combinations of real and surrogate data. 5. **Discussion**: The paper discusses potential generalizations of the scaling law, such as accounting for limited model complexity and differences in exponents between real and surrogate data. It also highlights the limitations of the scaling law in certain cases. Overall, the study provides a theoretical and empirical foundation for integrating surrogate data in machine learning, offering practical guidelines for optimizing training processes.The paper explores the integration of surrogate data into machine learning training to improve performance on the original distribution. Surrogate data, which is collected from different sources or synthetic models, is combined with real data to enhance the training process. The authors propose a weighted empirical risk minimization (ERM) approach, where the weight of the surrogate data is optimized to minimize the test error. Key findings include: 1. **Reduction in Test Error**: Integrating surrogate data can significantly reduce the test error on the original distribution, even when the surrogate data is unrelated to the original data. This improvement is attributed to Stein's paradox, which suggests that shrinking the empirical mean towards an arbitrary point can outperform the naive empirical mean. 2. **Optimal Weighting**: To leverage the benefits of surrogate data, it is crucial to use optimally weighted ERM. The optimal weight, denoted as \(\alpha\), can be tuned using cross-validation or empirical methods. 3. **Scaling Law**: The test error of models trained on mixtures of real and surrogate data follows a scaling law that can be used to predict the optimal weighting scheme and the amount of surrogate data to add. This scaling law is derived under several classical statistical models and validated through empirical experiments. 4. **Empirical Validation**: Theoretical results are supported by experiments on simulated and real-world datasets, including natural language processing, image classification, and survival analysis. The scaling law accurately predicts the test error for various combinations of real and surrogate data. 5. **Discussion**: The paper discusses potential generalizations of the scaling law, such as accounting for limited model complexity and differences in exponents between real and surrogate data. It also highlights the limitations of the scaling law in certain cases. Overall, the study provides a theoretical and empirical foundation for integrating surrogate data in machine learning, offering practical guidelines for optimizing training processes.

Scaling laws for learning with real and surrogate data

December 5, 2024 | Ayush Jain*, Andrea Montanari*, Eren Sasoglu*

December 5, 2024 | Ayush Jain, Andrea Montanari, Eren Sasoglu*