FineCLIPER is a novel framework for Dynamic Facial Expression Recognition (DFER) that incorporates multi-modal fine-grained CLIP with AdaptERs. The framework addresses the challenges of ambiguous facial expression semantics and subtle facial movements by extending class labels to textual descriptions from both positive and negative aspects, and using cross-modal similarity based on the CLIP model for supervision. It employs a hierarchical approach to mine useful cues from DFE videos, including low-level video frames, middle-level face segmentation and landmarks, and high-level descriptions generated by a multi-modal large language model (MLLM). Additionally, it uses Parameter-Efficient Fine-Tuning (PEFT) to adapt large pre-trained models efficiently. FineCLIPER achieves state-of-the-art performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. The framework's key innovations include label augmentation with PN descriptors and a semantically hierarchical strategy for information mining. The results show that FineCLIPER significantly improves performance, particularly in the "Disgust" category of the DFEW dataset. The framework also demonstrates effectiveness in zero-shot settings by leveraging video captions for pretraining. The method's performance is validated through extensive experiments and ablation studies, showing the effectiveness of the proposed strategies.FineCLIPER is a novel framework for Dynamic Facial Expression Recognition (DFER) that incorporates multi-modal fine-grained CLIP with AdaptERs. The framework addresses the challenges of ambiguous facial expression semantics and subtle facial movements by extending class labels to textual descriptions from both positive and negative aspects, and using cross-modal similarity based on the CLIP model for supervision. It employs a hierarchical approach to mine useful cues from DFE videos, including low-level video frames, middle-level face segmentation and landmarks, and high-level descriptions generated by a multi-modal large language model (MLLM). Additionally, it uses Parameter-Efficient Fine-Tuning (PEFT) to adapt large pre-trained models efficiently. FineCLIPER achieves state-of-the-art performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. The framework's key innovations include label augmentation with PN descriptors and a semantically hierarchical strategy for information mining. The results show that FineCLIPER significantly improves performance, particularly in the "Disgust" category of the DFEW dataset. The framework also demonstrates effectiveness in zero-shot settings by leveraging video captions for pretraining. The method's performance is validated through extensive experiments and ablation studies, showing the effectiveness of the proposed strategies.