25 Mar 2019 | Cihang Xie1,2*, Yuxin Wu2, Laurens van der Maaten2, Alan Yuille1, Kaiming He2
This paper explores the effectiveness of feature denoising in improving the adversarial robustness of convolutional networks. The authors observe that adversarial perturbations on images lead to noise in the features constructed by these networks. To address this, they develop new network architectures that include blocks designed to denoise features using methods such as non-local means or other filters. These networks are trained end-to-end, and when combined with adversarial training, they significantly enhance the state-of-the-art in adversarial robustness, both in white-box and black-box attack settings. On ImageNet, under 10-iteration PGD white-box attacks, their method achieves 55.7% accuracy, surpassing the prior art's 27.9%. Even under extreme 2000-iteration PGD white-box attacks, the method secures 42.6% accuracy. The method was ranked first in the Competition on Adversarial Attacks and Defenses (CAAD) 2018, achieving 50.6% classification accuracy on a secret, ImageNet-like test dataset against 48 unknown attackers. The paper also discusses the design and evaluation of various denoising operations and their impact on adversarial robustness.This paper explores the effectiveness of feature denoising in improving the adversarial robustness of convolutional networks. The authors observe that adversarial perturbations on images lead to noise in the features constructed by these networks. To address this, they develop new network architectures that include blocks designed to denoise features using methods such as non-local means or other filters. These networks are trained end-to-end, and when combined with adversarial training, they significantly enhance the state-of-the-art in adversarial robustness, both in white-box and black-box attack settings. On ImageNet, under 10-iteration PGD white-box attacks, their method achieves 55.7% accuracy, surpassing the prior art's 27.9%. Even under extreme 2000-iteration PGD white-box attacks, the method secures 42.6% accuracy. The method was ranked first in the Competition on Adversarial Attacks and Defenses (CAAD) 2018, achieving 50.6% classification accuracy on a secret, ImageNet-like test dataset against 48 unknown attackers. The paper also discusses the design and evaluation of various denoising operations and their impact on adversarial robustness.