[slides and audio] Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks

This paper proposes a multi-task cascaded convolutional neural network (CNN) framework for joint face detection and alignment. The framework consists of three stages of deep convolutional networks that predict face and landmark locations in a coarse-to-fine manner. The first stage generates candidate windows, the second refines these candidates, and the third produces final bounding boxes and facial landmarks. The framework also introduces an online hard sample mining strategy to improve performance automatically without manual selection. The method achieves superior accuracy on the FDDB and WIDER FACE benchmarks for face detection, and the AFLW benchmark for face alignment, while maintaining real-time performance. The framework leverages the correlation between face detection and alignment tasks to enhance performance. It uses a cascaded structure with three stages of CNNs, where each stage progressively refines the results. The first stage uses a shallow CNN to generate candidate windows, the second stage uses a more complex CNN to refine these candidates, and the third stage uses a more powerful CNN to output facial landmarks. The online hard sample mining strategy selects the most difficult samples for training, improving the detector's performance. The proposed method is evaluated on several benchmarks, including FDDB, WIDER FACE, and AFLW. It outperforms state-of-the-art methods in both face detection and alignment tasks. The method is also efficient, achieving high speed on both CPU and GPU. The framework is designed to be lightweight and efficient, with a focus on real-time performance. The results show that the multi-task learning approach significantly improves the performance of face detection and alignment tasks.This paper proposes a multi-task cascaded convolutional neural network (CNN) framework for joint face detection and alignment. The framework consists of three stages of deep convolutional networks that predict face and landmark locations in a coarse-to-fine manner. The first stage generates candidate windows, the second refines these candidates, and the third produces final bounding boxes and facial landmarks. The framework also introduces an online hard sample mining strategy to improve performance automatically without manual selection. The method achieves superior accuracy on the FDDB and WIDER FACE benchmarks for face detection, and the AFLW benchmark for face alignment, while maintaining real-time performance. The framework leverages the correlation between face detection and alignment tasks to enhance performance. It uses a cascaded structure with three stages of CNNs, where each stage progressively refines the results. The first stage uses a shallow CNN to generate candidate windows, the second stage uses a more complex CNN to refine these candidates, and the third stage uses a more powerful CNN to output facial landmarks. The online hard sample mining strategy selects the most difficult samples for training, improving the detector's performance. The proposed method is evaluated on several benchmarks, including FDDB, WIDER FACE, and AFLW. It outperforms state-of-the-art methods in both face detection and alignment tasks. The method is also efficient, achieving high speed on both CPU and GPU. The framework is designed to be lightweight and efficient, with a focus on real-time performance. The results show that the multi-task learning approach significantly improves the performance of face detection and alignment tasks.

Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

2015 | Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, Senior Member, IEEE, and Yu Qiao, Senior Member, IEEE