Bilinear CNNs for Fine-grained Visual Recognition

Bilinear CNNs for Fine-grained Visual Recognition

1 Jun 2017 | Tsung-Yu Lin, Aruni RoyChowdhury, Subhransu Maji
Bilinear Convolutional Neural Networks (B-CNNs) are introduced for fine-grained visual recognition. These networks represent images as pooled outer products of features from two CNNs, capturing localized feature interactions in a translationally invariant manner. B-CNNs are orderless texture representations that can be trained end-to-end. The most accurate model achieves 84.1%, 79.4%, 86.9%, and 91.3% per-image accuracy on Caltech-UCSD birds, NABirds, FGVC aircraft, and Stanford cars datasets, respectively, running at 30 frames-per-second on a NVIDIA Titan X GPU. The paper presents a systematic analysis showing that bilinear features are highly redundant and can be reduced in size without significant accuracy loss. B-CNNs are effective for texture and scene recognition and can be trained from scratch on ImageNet. Visualizations of models on various datasets are provided, and the source code is available at http://vis-www.cs.umass.edu/bcnn. The paper compares B-CNNs to classical texture representations like BoVW, VLAD, FV, and O2P, showing they can be written as B-CNNs. B-CNNs outperform existing models on fine-grained recognition tasks and are efficient. The paper also evaluates B-CNNs on texture and scene recognition tasks, showing they are effective for these tasks. The analysis shows that B-CNNs can achieve high accuracy with reduced dimensionality and are efficient. The paper concludes that B-CNNs are a promising approach for fine-grained visual recognition.Bilinear Convolutional Neural Networks (B-CNNs) are introduced for fine-grained visual recognition. These networks represent images as pooled outer products of features from two CNNs, capturing localized feature interactions in a translationally invariant manner. B-CNNs are orderless texture representations that can be trained end-to-end. The most accurate model achieves 84.1%, 79.4%, 86.9%, and 91.3% per-image accuracy on Caltech-UCSD birds, NABirds, FGVC aircraft, and Stanford cars datasets, respectively, running at 30 frames-per-second on a NVIDIA Titan X GPU. The paper presents a systematic analysis showing that bilinear features are highly redundant and can be reduced in size without significant accuracy loss. B-CNNs are effective for texture and scene recognition and can be trained from scratch on ImageNet. Visualizations of models on various datasets are provided, and the source code is available at http://vis-www.cs.umass.edu/bcnn. The paper compares B-CNNs to classical texture representations like BoVW, VLAD, FV, and O2P, showing they can be written as B-CNNs. B-CNNs outperform existing models on fine-grained recognition tasks and are efficient. The paper also evaluates B-CNNs on texture and scene recognition tasks, showing they are effective for these tasks. The analysis shows that B-CNNs can achieve high accuracy with reduced dimensionality and are efficient. The paper concludes that B-CNNs are a promising approach for fine-grained visual recognition.
Reach us at info@study.space