14 May 2024 | Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, Eric Granger
This paper proposes a novel learning strategy to enhance the interpretability of deep facial expression recognition (FER) classifiers. The approach explicitly incorporates spatial action unit (AU) cues into the training process, enabling the development of interpretable models that align with expert knowledge. By leveraging spatial AU cues derived from facial landmarks and image-class labels, the method constructs a discriminative heatmap that highlights the most relevant regions for expression recognition. This heatmap is then used to constrain the spatial features of the classifier to be correlated with AU cues during training. A composite loss function is employed to train the classifier to correctly classify images while generating interpretable visual attention maps aligned with AU maps, simulating the expert decision process. The proposed strategy does not require additional manual annotations and is applicable to any deep CNN or transformer-based classifier without architectural changes. Extensive experiments on two public benchmarks, RAF-DB and AffectNet, demonstrate that the proposed method improves layer-wise interpretability without degrading classification performance. Additionally, the approach enhances the interpretability of class activation maps (CAMs) used in FER. The method is generic, efficient, and provides reliable spatial cues for discriminative region localization, making it a valuable tool for interpretable FER systems.This paper proposes a novel learning strategy to enhance the interpretability of deep facial expression recognition (FER) classifiers. The approach explicitly incorporates spatial action unit (AU) cues into the training process, enabling the development of interpretable models that align with expert knowledge. By leveraging spatial AU cues derived from facial landmarks and image-class labels, the method constructs a discriminative heatmap that highlights the most relevant regions for expression recognition. This heatmap is then used to constrain the spatial features of the classifier to be correlated with AU cues during training. A composite loss function is employed to train the classifier to correctly classify images while generating interpretable visual attention maps aligned with AU maps, simulating the expert decision process. The proposed strategy does not require additional manual annotations and is applicable to any deep CNN or transformer-based classifier without architectural changes. Extensive experiments on two public benchmarks, RAF-DB and AffectNet, demonstrate that the proposed method improves layer-wise interpretability without degrading classification performance. Additionally, the approach enhances the interpretability of class activation maps (CAMs) used in FER. The method is generic, efficient, and provides reliable spatial cues for discriminative region localization, making it a valuable tool for interpretable FER systems.