[slides and audio] Bilinear CNN Models for Fine-Grained Visual Recognition

The paper introduces Bilinear Convolutional Neural Networks (B-CNNs) for fine-grained visual recognition, which represent images as pooled outer products of features derived from two CNNs. B-CNNs capture localized feature interactions in a translationally invariant manner and can be trained end-to-end. The authors report high accuracy on datasets such as Caltech-UCSD birds, NABirds, FGVC aircraft, and Stanford cars, achieving 84.1%, 79.4%, 86.9%, and 91.3% per-image accuracy, respectively, with a running speed of 30 frames per second on a NVIDIA Titan X GPU. The paper also presents a systematic analysis of B-CNNs, showing that bilinear features can be reduced by an order of magnitude without significant loss in accuracy, and that these features are effective for other image classification tasks like texture and scene recognition. Additionally, B-CNNs can be trained from scratch on the ImageNet dataset, offering consistent improvements over baseline architectures. Visualizations of top activations and inverse images are provided, and the complete source code is available online.The paper introduces Bilinear Convolutional Neural Networks (B-CNNs) for fine-grained visual recognition, which represent images as pooled outer products of features derived from two CNNs. B-CNNs capture localized feature interactions in a translationally invariant manner and can be trained end-to-end. The authors report high accuracy on datasets such as Caltech-UCSD birds, NABirds, FGVC aircraft, and Stanford cars, achieving 84.1%, 79.4%, 86.9%, and 91.3% per-image accuracy, respectively, with a running speed of 30 frames per second on a NVIDIA Titan X GPU. The paper also presents a systematic analysis of B-CNNs, showing that bilinear features can be reduced by an order of magnitude without significant loss in accuracy, and that these features are effective for other image classification tasks like texture and scene recognition. Additionally, B-CNNs can be trained from scratch on the ImageNet dataset, offering consistent improvements over baseline architectures. Visualizations of top activations and inverse images are provided, and the complete source code is available online.

Bilinear CNNs for Fine-grained Visual Recognition

1 Jun 2017 | Tsung-Yu Lin, Aruni RoyChowdhury, Subhransu Maji