[slides] A Statistical Theory of Regularization-Based Continual Learning

This paper provides a statistical analysis of regularization-based continual learning on a sequence of linear regression tasks, focusing on how different regularization terms affect model performance. The authors derive the convergence rate for the oracle estimator, which assumes all data are available simultaneously. They introduce a family of generalized \(\ell_2\)-regularization algorithms indexed by matrix-valued hyperparameters, including the minimum norm estimator and continual ridge regression as special cases. An iterative update formula for the estimation error of these algorithms is derived, allowing for the determination of optimal hyperparameters. These hyperparameters effectively balance the trade-off between forward and backward knowledge transfer and adjust for data heterogeneity. The estimation error of the optimal algorithm is explicitly derived and shown to be of the same order as the oracle estimator. Lower bounds for the minimum norm estimator and continual ridge regression demonstrate their suboptimality. A key finding is the equivalence between early stopping and generalized \(\ell_2\)-regularization in continual learning. Theoretical results are complemented by simulation experiments, which show that the proposed method outperforms existing algorithms in terms of estimation error and avoids catastrophic forgetting.This paper provides a statistical analysis of regularization-based continual learning on a sequence of linear regression tasks, focusing on how different regularization terms affect model performance. The authors derive the convergence rate for the oracle estimator, which assumes all data are available simultaneously. They introduce a family of generalized \(\ell_2\)-regularization algorithms indexed by matrix-valued hyperparameters, including the minimum norm estimator and continual ridge regression as special cases. An iterative update formula for the estimation error of these algorithms is derived, allowing for the determination of optimal hyperparameters. These hyperparameters effectively balance the trade-off between forward and backward knowledge transfer and adjust for data heterogeneity. The estimation error of the optimal algorithm is explicitly derived and shown to be of the same order as the oracle estimator. Lower bounds for the minimum norm estimator and continual ridge regression demonstrate their suboptimality. A key finding is the equivalence between early stopping and generalized \(\ell_2\)-regularization in continual learning. Theoretical results are complemented by simulation experiments, which show that the proposed method outperforms existing algorithms in terms of estimation error and avoids catastrophic forgetting.

A Statistical Theory of Regularization-Based Continual Learning

2024 | Xuyang Zhao, Huiyuan Wang, Weiran Huang, Wei Lin