October 28–November 1, 2024, Melbourne, VIC, Australia | Haodong Chen, Haojian Huang, Junhao Dong, Mingzhe Zheng, and Dian Shao
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior, but current methods face challenges due to noisy data, insufficient utilization of facial dynamics, and ambiguous expression semantics. To address these issues, the authors propose FineCLIPER, a novel framework that incorporates multi-modal fine-grained CLIP for DFER with AdaptERs. The key innovations of FineCLIPER include:
1. **Label Augmentation**: Extending class labels to textual descriptions (both positive and negative) using the CLIP model's cross-modal latent space to better distinguish between similar facial expressions.
2. **Hierarchical Information Mining**: Utilizing a hierarchical approach to mine useful cues from DFE videos:
- **Low Semantic Level**: Directly embedding video frames.
- **Middle Semantic Level**: Extracting face segmentation masks and landmarks.
- **High Semantic Level**: Generating detailed descriptions of facial changes across frames using a Multi-modal Large Language Model (MLLM).
Additionally, FineCLIPER employs Parameter-Efficient Fine-Tuning (PEFT) to efficiently adapt large pre-trained models like CLIP. The framework achieves state-of-the-art performance on DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with minimal tunable parameters. Extensive experiments and ablation studies validate the effectiveness of each component, demonstrating the superior performance of FineCLIPER compared to existing methods.Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior, but current methods face challenges due to noisy data, insufficient utilization of facial dynamics, and ambiguous expression semantics. To address these issues, the authors propose FineCLIPER, a novel framework that incorporates multi-modal fine-grained CLIP for DFER with AdaptERs. The key innovations of FineCLIPER include:
1. **Label Augmentation**: Extending class labels to textual descriptions (both positive and negative) using the CLIP model's cross-modal latent space to better distinguish between similar facial expressions.
2. **Hierarchical Information Mining**: Utilizing a hierarchical approach to mine useful cues from DFE videos:
- **Low Semantic Level**: Directly embedding video frames.
- **Middle Semantic Level**: Extracting face segmentation masks and landmarks.
- **High Semantic Level**: Generating detailed descriptions of facial changes across frames using a Multi-modal Large Language Model (MLLM).
Additionally, FineCLIPER employs Parameter-Efficient Fine-Tuning (PEFT) to efficiently adapt large pre-trained models like CLIP. The framework achieves state-of-the-art performance on DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with minimal tunable parameters. Extensive experiments and ablation studies validate the effectiveness of each component, demonstrating the superior performance of FineCLIPER compared to existing methods.