June 2024 | Fan Zhang¹, Gongguan Chen¹, Hua Wang² (✉), and Caiming Zhang³
This study proposes a cross-fusion dual-attention network (CF-DAN) for facial-expression recognition (FER) in real-world environments. The network consists of three components: (1) a cross-fusion grouped dual-attention mechanism to refine local features and obtain global information; (2) a proposed C² activation function construction method, which is a piecewise cubic polynomial with three degrees of freedom, requiring less computation with improved flexibility and recognition abilities; and (3) a closed-loop operation between the self-attention distillation process and residual connections to suppress redundant information and improve the generalization ability of the model. The recognition accuracies on the RAF-DB, FERPlus, and AffectNet datasets were 92.78%, 92.02%, and 63.58%, respectively. Experiments show that this model can provide more effective solutions for FER tasks.
The CF-DAN network uses a dual-attention mechanism to extract both spatial and channel features, enabling the model to better capture global and local information. The C² activation function improves the ability of the interactive learning mechanism to integrate different features. Additionally, the self-attention distillation mechanism reduces the computational cost of the self-attention mechanism by dividing attention into different groups and using self-attention distillation in each group to reduce the spatial dimensions of K and V, thereby reducing the computational cost by 33%.
The study also evaluates the effectiveness of the proposed model on three commonly used facial expression datasets: AffectNet, RAF-DB, and FERPlus. The results show that the proposed model outperforms other methods in terms of recognition accuracy and computational efficiency. The model achieves a recognition accuracy of 63.58% on the AffectNet dataset, 92.78% on the RAF-DB dataset, and 92.02% on the FERPlus dataset. The model also demonstrates strong performance in terms of feature fusion and recognition of different facial expressions.
The study also conducts an ablation study to evaluate the performance of the dual-attention and interactive-learning mechanisms. The results show that the addition of the channel dimension self-attention and interactive learning mechanism in parallel processing significantly improves the classification accuracy in FER tasks. The study also evaluates the effect of the fusion mechanism on the attention regions for facial expressions and shows that the proposed model can effectively capture key information from different dimensions.
The study also analyzes the validity of the proposed activation function and shows that it is continuously differentiable at all points and does not involve exponentiation, thereby improving the model's operational speed compared with that of the sigmoid and tanh activation functions. The proposed activation function performs better in FER tasks compared to other activation functions.
The study also evaluates the performance of self-attention distillation and shows that it reduces the computational cost of the model and improves the generalization ability of the model. The study also conducts a parameter sensitivity analysisThis study proposes a cross-fusion dual-attention network (CF-DAN) for facial-expression recognition (FER) in real-world environments. The network consists of three components: (1) a cross-fusion grouped dual-attention mechanism to refine local features and obtain global information; (2) a proposed C² activation function construction method, which is a piecewise cubic polynomial with three degrees of freedom, requiring less computation with improved flexibility and recognition abilities; and (3) a closed-loop operation between the self-attention distillation process and residual connections to suppress redundant information and improve the generalization ability of the model. The recognition accuracies on the RAF-DB, FERPlus, and AffectNet datasets were 92.78%, 92.02%, and 63.58%, respectively. Experiments show that this model can provide more effective solutions for FER tasks.
The CF-DAN network uses a dual-attention mechanism to extract both spatial and channel features, enabling the model to better capture global and local information. The C² activation function improves the ability of the interactive learning mechanism to integrate different features. Additionally, the self-attention distillation mechanism reduces the computational cost of the self-attention mechanism by dividing attention into different groups and using self-attention distillation in each group to reduce the spatial dimensions of K and V, thereby reducing the computational cost by 33%.
The study also evaluates the effectiveness of the proposed model on three commonly used facial expression datasets: AffectNet, RAF-DB, and FERPlus. The results show that the proposed model outperforms other methods in terms of recognition accuracy and computational efficiency. The model achieves a recognition accuracy of 63.58% on the AffectNet dataset, 92.78% on the RAF-DB dataset, and 92.02% on the FERPlus dataset. The model also demonstrates strong performance in terms of feature fusion and recognition of different facial expressions.
The study also conducts an ablation study to evaluate the performance of the dual-attention and interactive-learning mechanisms. The results show that the addition of the channel dimension self-attention and interactive learning mechanism in parallel processing significantly improves the classification accuracy in FER tasks. The study also evaluates the effect of the fusion mechanism on the attention regions for facial expressions and shows that the proposed model can effectively capture key information from different dimensions.
The study also analyzes the validity of the proposed activation function and shows that it is continuously differentiable at all points and does not involve exponentiation, thereby improving the model's operational speed compared with that of the sigmoid and tanh activation functions. The proposed activation function performs better in FER tasks compared to other activation functions.
The study also evaluates the performance of self-attention distillation and shows that it reduces the computational cost of the model and improves the generalization ability of the model. The study also conducts a parameter sensitivity analysis