[slides and audio] Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

This paper revisits the challenge of making image classifiers robust to imperceptible perturbations, focusing on the CIFAR10 dataset. Despite achieving 100% clean accuracy, state-of-the-art (SOTA) models struggle to exceed 70% robustness to $\ell_\infty$-norm bounded perturbations. The authors develop the first scaling laws for adversarial training, revealing inefficiencies in prior methods and providing actionable insights. They discover that SOTA methods often diverge from optimal compute usage, using excessive resources for their level of robustness. By leveraging a more efficient setup, they achieve 20% fewer training (inference) FLOPs while surpassing the prior SOTA with 74% AutoAttack accuracy (+3% gain). However, scaling laws predict that robustness plateaus at around 90%, making perfect robustness impractical. A small-scale human evaluation on AutoAttack data shows that human performance also plateaus near 90%, attributed to $\ell_\infty$-constrained attacks generating invalid images that do not align with their original labels. The paper outlines promising directions for future research, emphasizing the need for more efficient training algorithms and improved architectures, as well as the necessity of rethinking attack formulations to account for image validity.This paper revisits the challenge of making image classifiers robust to imperceptible perturbations, focusing on the CIFAR10 dataset. Despite achieving 100% clean accuracy, state-of-the-art (SOTA) models struggle to exceed 70% robustness to $\ell_\infty$-norm bounded perturbations. The authors develop the first scaling laws for adversarial training, revealing inefficiencies in prior methods and providing actionable insights. They discover that SOTA methods often diverge from optimal compute usage, using excessive resources for their level of robustness. By leveraging a more efficient setup, they achieve 20% fewer training (inference) FLOPs while surpassing the prior SOTA with 74% AutoAttack accuracy (+3% gain). However, scaling laws predict that robustness plateaus at around 90%, making perfect robustness impractical. A small-scale human evaluation on AutoAttack data shows that human performance also plateaus near 90%, attributed to $\ell_\infty$-constrained attacks generating invalid images that do not align with their original labels. The paper outlines promising directions for future research, emphasizing the need for more efficient training algorithms and improved architectures, as well as the necessity of rethinking attack formulations to account for image validity.

Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

2024 | Brian R. Bartoldson, James Diffenderfer, Konstantinos Parasyris, Bhavya Kailkhura