[slides and audio] Feature selection%2C L1 vs. L2 regularization%2C and rotational invariance

The paper by Andrew Y. Ng explores the effectiveness of $L_1$ and $L_2$ regularization in supervised learning, particularly in the presence of many irrelevant features. For logistic regression, the author demonstrates that using $L_1$ regularization can achieve a sample complexity that grows only logarithmically in the number of irrelevant features, making it effective even when the number of irrelevant features is exponentially large compared to the training set size. This is in contrast to $L_2$ regularization, which, being rotationally invariant, requires a sample complexity that grows at least linearly in the number of irrelevant features. The paper also provides theoretical bounds and empirical results to support these findings, showing that $L_1$ regularization is more robust to irrelevant features and can learn effectively with fewer training examples.The paper by Andrew Y. Ng explores the effectiveness of $L_1$ and $L_2$ regularization in supervised learning, particularly in the presence of many irrelevant features. For logistic regression, the author demonstrates that using $L_1$ regularization can achieve a sample complexity that grows only logarithmically in the number of irrelevant features, making it effective even when the number of irrelevant features is exponentially large compared to the training set size. This is in contrast to $L_2$ regularization, which, being rotationally invariant, requires a sample complexity that grows at least linearly in the number of irrelevant features. The paper also provides theoretical bounds and empirical results to support these findings, showing that $L_1$ regularization is more robust to irrelevant features and can learn effectively with fewer training examples.

Feature selection, L1 vs. L2 regularization, and rotational invariance

2004 | Andrew Y. Ng