This paper proposes a novel deep learning framework for predicting face attributes in the wild. The framework consists of two convolutional neural networks (CNNs), LNet and ANet. LNet is pre-trained on a large number of general object categories for face localization, while ANet is pre-trained on a large number of face identities for attribute prediction. The two networks are then fine-tuned together using attribute tags. This approach outperforms existing methods significantly and reveals important insights into learning face representation.
LNet is trained in a weakly supervised manner, using only image-level attribute tags, which simplifies data preparation. It is pre-trained on ImageNet categories to handle background clutter and then fine-tuned with attribute tags. This enables LNet to accurately localize faces without requiring bounding boxes or landmarks. ANet is pre-trained on face identities and then fine-tuned with attribute tags to extract discriminative face representations. The pre-training step allows ANet to account for complex variations in unconstrained face images.
The framework also introduces a novel fast feed-forward algorithm for CNNs with locally shared filters, which reduces redundant computation. This algorithm enables efficient feature extraction and improves performance. The study reveals that pre-training with massive object categories and face identities enhances feature learning for face localization and attribute recognition. It also shows that filters in LNet, although fine-tuned with attribute tags, can indicate face locations through their response maps. Additionally, ANet's high-level hidden neurons implicitly learn semantic concepts related to identity, such as race, gender, and age.
The framework is evaluated on two large face attribute datasets, CelebA and LFWA, achieving state-of-the-art results. It outperforms existing methods in attribute prediction and demonstrates strong generalization ability. The study also reveals that the number of attributes affects localization performance, and that a sparse linear combination of semantic concepts can effectively explain attributes. The framework is shown to be robust to various challenges, including extreme poses and occlusions, and has potential for real-world applications.This paper proposes a novel deep learning framework for predicting face attributes in the wild. The framework consists of two convolutional neural networks (CNNs), LNet and ANet. LNet is pre-trained on a large number of general object categories for face localization, while ANet is pre-trained on a large number of face identities for attribute prediction. The two networks are then fine-tuned together using attribute tags. This approach outperforms existing methods significantly and reveals important insights into learning face representation.
LNet is trained in a weakly supervised manner, using only image-level attribute tags, which simplifies data preparation. It is pre-trained on ImageNet categories to handle background clutter and then fine-tuned with attribute tags. This enables LNet to accurately localize faces without requiring bounding boxes or landmarks. ANet is pre-trained on face identities and then fine-tuned with attribute tags to extract discriminative face representations. The pre-training step allows ANet to account for complex variations in unconstrained face images.
The framework also introduces a novel fast feed-forward algorithm for CNNs with locally shared filters, which reduces redundant computation. This algorithm enables efficient feature extraction and improves performance. The study reveals that pre-training with massive object categories and face identities enhances feature learning for face localization and attribute recognition. It also shows that filters in LNet, although fine-tuned with attribute tags, can indicate face locations through their response maps. Additionally, ANet's high-level hidden neurons implicitly learn semantic concepts related to identity, such as race, gender, and age.
The framework is evaluated on two large face attribute datasets, CelebA and LFWA, achieving state-of-the-art results. It outperforms existing methods in attribute prediction and demonstrates strong generalization ability. The study also reveals that the number of attributes affects localization performance, and that a sparse linear combination of semantic concepts can effectively explain attributes. The framework is shown to be robust to various challenges, including extreme poses and occlusions, and has potential for real-world applications.