24 Jul 2021 | Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer
The paper introduces four new real-world distribution shift datasets: ImageNet-Renditions, StreetView StoreFronts, DeepFashion Remixed, and Real Blurry Images. These datasets capture various changes in image style, blurriness, geographic location, and camera operation. The authors evaluate four methods for improving out-of-distribution robustness: larger models, self-attention, diverse data augmentation, and pretraining. They find that larger models and diverse data augmentations can improve robustness on real-world distribution shifts, contrary to previous claims. The introduction of DeepAugment, a new data augmentation method, further enhances robustness, outperforming models pre-trained with 1000× more labeled data. The study concludes that while some methods consistently help with certain types of distribution shifts, no single method consistently improves robustness across all shifts, highlighting the need for more comprehensive evaluation using multiple robustness datasets.The paper introduces four new real-world distribution shift datasets: ImageNet-Renditions, StreetView StoreFronts, DeepFashion Remixed, and Real Blurry Images. These datasets capture various changes in image style, blurriness, geographic location, and camera operation. The authors evaluate four methods for improving out-of-distribution robustness: larger models, self-attention, diverse data augmentation, and pretraining. They find that larger models and diverse data augmentations can improve robustness on real-world distribution shifts, contrary to previous claims. The introduction of DeepAugment, a new data augmentation method, further enhances robustness, outperforming models pre-trained with 1000× more labeled data. The study concludes that while some methods consistently help with certain types of distribution shifts, no single method consistently improves robustness across all shifts, highlighting the need for more comprehensive evaluation using multiple robustness datasets.