[slides and audio] Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

This paper addresses the challenge of improving the interpretability of facial expression recognition (FER) models while maintaining high accuracy. The authors propose a novel learning strategy that explicitly incorporates spatial action unit (AU) cues into the training of deep interpretable models. By using a codebook of AUs and facial landmarks, a discriminative AU heatmap is constructed to guide the model's attention to discriminative regions of interest (ROIs) associated with facial expressions. This approach allows the model to learn layer-wise attention maps that align with expert decision-making processes, enhancing interpretability without compromising classification performance. The method is generic and can be applied to any deep CNN or transformer-based classifier without requiring architectural changes or significant additional training time. Extensive evaluations on the RAF-DB and AffectNet datasets demonstrate that the proposed strategy improves layer-wise interpretability without degrading classification accuracy. Additionally, the paper explores the use of class activation mapping (CAM) methods and shows that the proposed approach can also enhance CAM interpretability. The main contributions include a learning strategy that improves visual interpretability, a method that does not require extra manual annotation or significant computational overhead, and a generic approach applicable to various deep models.This paper addresses the challenge of improving the interpretability of facial expression recognition (FER) models while maintaining high accuracy. The authors propose a novel learning strategy that explicitly incorporates spatial action unit (AU) cues into the training of deep interpretable models. By using a codebook of AUs and facial landmarks, a discriminative AU heatmap is constructed to guide the model's attention to discriminative regions of interest (ROIs) associated with facial expressions. This approach allows the model to learn layer-wise attention maps that align with expert decision-making processes, enhancing interpretability without compromising classification performance. The method is generic and can be applied to any deep CNN or transformer-based classifier without requiring architectural changes or significant additional training time. Extensive evaluations on the RAF-DB and AffectNet datasets demonstrate that the proposed strategy improves layer-wise interpretability without degrading classification accuracy. Additionally, the paper explores the use of class activation mapping (CAM) methods and shows that the proposed approach can also enhance CAM interpretability. The main contributions include a learning strategy that improves visual interpretability, a method that does not require extra manual annotation or significant computational overhead, and a generic approach applicable to various deep models.

Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

14 May 2024 | Soufiane Belharbi1, Marco Pedersoli1, Alessandro Lameiras Koerich1, Simon Bacon2, and Eric Granger1