2 Nov 2024 | Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander G. Hauptmann
The paper introduces Emotion-LLaMA, a novel multimodal large language model designed to accurately recognize and interpret human emotions in real-world scenarios. The model addresses the limitations of traditional single-modality approaches and existing Multimodal Large Language Models (MLLMs) by integrating audio, visual, and textual inputs through emotion-specific encoders. The key contributions of the work include:
1. **MERR Dataset**: A comprehensive dataset containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. The dataset enables models to learn from varied scenarios and generalize to real-world applications.
2. **Emotion-LLaMA Model**: This model integrates HuBERT for audio processing and multiview visual encoders (MAE, VideoMAE, EVA) for capturing facial details, dynamics, and context. By aligning these features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA enhances both emotional recognition and reasoning capabilities.
3. **Instruction Tuning**: The model is trained using a multi-task learning scheme, including pre-training on the MERR dataset and fine-tuning on target datasets like MER2023 and DFEW. This approach ensures the model not only identifies emotions accurately but also understands the underlying context and reasoning behind each emotion.
4. **Performance Evaluation**: Extensive evaluations show that Emotion-LLaMA outperforms other MLLMs in various datasets, achieving top scores in Clue Overlap (7.83), Label Overlap (6.25), F1 score (0.9036 on MER2023-SEM), and UAR (45.59) and WAR (59.37) on the DFEW dataset.
5. **Qualitative Analysis**: The paper provides detailed qualitative comparisons of emotion reasoning results, demonstrating Emotion-LLaMA's superior ability to integrate information from multiple modalities and capture subtle emotional cues.
Overall, Emotion-LLaMA represents a significant advancement in the field of multimodal emotion recognition and reasoning, offering a robust and versatile solution for real-world applications.The paper introduces Emotion-LLaMA, a novel multimodal large language model designed to accurately recognize and interpret human emotions in real-world scenarios. The model addresses the limitations of traditional single-modality approaches and existing Multimodal Large Language Models (MLLMs) by integrating audio, visual, and textual inputs through emotion-specific encoders. The key contributions of the work include:
1. **MERR Dataset**: A comprehensive dataset containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. The dataset enables models to learn from varied scenarios and generalize to real-world applications.
2. **Emotion-LLaMA Model**: This model integrates HuBERT for audio processing and multiview visual encoders (MAE, VideoMAE, EVA) for capturing facial details, dynamics, and context. By aligning these features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA enhances both emotional recognition and reasoning capabilities.
3. **Instruction Tuning**: The model is trained using a multi-task learning scheme, including pre-training on the MERR dataset and fine-tuning on target datasets like MER2023 and DFEW. This approach ensures the model not only identifies emotions accurately but also understands the underlying context and reasoning behind each emotion.
4. **Performance Evaluation**: Extensive evaluations show that Emotion-LLaMA outperforms other MLLMs in various datasets, achieving top scores in Clue Overlap (7.83), Label Overlap (6.25), F1 score (0.9036 on MER2023-SEM), and UAR (45.59) and WAR (59.37) on the DFEW dataset.
5. **Qualitative Analysis**: The paper provides detailed qualitative comparisons of emotion reasoning results, demonstrating Emotion-LLaMA's superior ability to integrate information from multiple modalities and capture subtle emotional cues.
Overall, Emotion-LLaMA represents a significant advancement in the field of multimodal emotion recognition and reasoning, offering a robust and versatile solution for real-world applications.