2024-09-08 | Li Lin, Sarah Papabathini, Xin Wang, Shu Hu
This paper introduces a lightweight framework for robust facial affective behavior recognition, combining CLIP image encoder with a trainable multilayer perceptron (MLP), enhanced with Conditional Value at Risk (CVaR) for robustness and loss landscape flattening for improved generalization. The framework efficiently handles both expression classification and action unit (AU) detection. The CLIP ViT L/14 encoder captures high-level facial features, while the MLP is trained for the specific task. CVaR is integrated into the loss functions to improve accuracy and reliability, especially in challenging scenarios. The optimization component smooths the loss landscape, enhancing generalization and performance. Experimental results on the Aff-wild2 dataset show superior performance compared to the baseline, with minimal computational demands. The framework outperforms the baseline in both tasks, achieving a 11% improvement in expression classification and a 4% improvement in action unit detection. The method is efficient, lightweight, and suitable for real-world applications. The code is available at https://github.com/Purdue-M2/Affective_Behavior_Analysis_M2_PURDUE. The contributions include the first lightweight framework for expression classification and AU detection, the integration of CVaR into loss functions, and superior performance on the Aff-wild2 dataset. The framework is adaptable to both challenges, with the output layer configured for the specific task. The method is robust to imbalanced data and domain shifts, making it suitable for real-world applications. The experiments demonstrate the effectiveness of the proposed method in accurately classifying expressions and detecting action units.This paper introduces a lightweight framework for robust facial affective behavior recognition, combining CLIP image encoder with a trainable multilayer perceptron (MLP), enhanced with Conditional Value at Risk (CVaR) for robustness and loss landscape flattening for improved generalization. The framework efficiently handles both expression classification and action unit (AU) detection. The CLIP ViT L/14 encoder captures high-level facial features, while the MLP is trained for the specific task. CVaR is integrated into the loss functions to improve accuracy and reliability, especially in challenging scenarios. The optimization component smooths the loss landscape, enhancing generalization and performance. Experimental results on the Aff-wild2 dataset show superior performance compared to the baseline, with minimal computational demands. The framework outperforms the baseline in both tasks, achieving a 11% improvement in expression classification and a 4% improvement in action unit detection. The method is efficient, lightweight, and suitable for real-world applications. The code is available at https://github.com/Purdue-M2/Affective_Behavior_Analysis_M2_PURDUE. The contributions include the first lightweight framework for expression classification and AU detection, the integration of CVaR into loss functions, and superior performance on the Aff-wild2 dataset. The framework is adaptable to both challenges, with the output layer configured for the specific task. The method is robust to imbalanced data and domain shifts, making it suitable for real-world applications. The experiments demonstrate the effectiveness of the proposed method in accurately classifying expressions and detecting action units.