9 Nov 2022 | Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, Wieland Brendel
Convolutional Neural Networks (CNNs) are commonly thought to recognize objects by learning increasingly complex representations of object shapes. However, recent studies suggest that image textures may play a more significant role. This study quantitatively tests these hypotheses by evaluating CNNs and human observers on images with conflicting texture and shape cues. The results show that ImageNet-trained CNNs are biased towards recognizing textures rather than shapes, which contrasts with human behavior. When trained on a stylized version of ImageNet (Stylized-ImageNet), the same ResNet-50 architecture learns a shape-based representation, aligning better with human performance. This approach also improves object detection and robustness to various image distortions. The study highlights the advantages of shape-based representations over texture-based ones in CNNs. The findings suggest that CNNs may be over-relying on textures rather than global shapes for object recognition, and that training on stylized data can shift their bias towards shapes, leading to better performance and robustness. The results demonstrate that shape-based representations are more effective for recognition tasks and robust to image distortions, offering a more accurate model of human visual processing.Convolutional Neural Networks (CNNs) are commonly thought to recognize objects by learning increasingly complex representations of object shapes. However, recent studies suggest that image textures may play a more significant role. This study quantitatively tests these hypotheses by evaluating CNNs and human observers on images with conflicting texture and shape cues. The results show that ImageNet-trained CNNs are biased towards recognizing textures rather than shapes, which contrasts with human behavior. When trained on a stylized version of ImageNet (Stylized-ImageNet), the same ResNet-50 architecture learns a shape-based representation, aligning better with human performance. This approach also improves object detection and robustness to various image distortions. The study highlights the advantages of shape-based representations over texture-based ones in CNNs. The findings suggest that CNNs may be over-relying on textures rather than global shapes for object recognition, and that training on stylized data can shift their bias towards shapes, leading to better performance and robustness. The results demonstrate that shape-based representations are more effective for recognition tasks and robust to image distortions, offering a more accurate model of human visual processing.