10 Apr 2024 | Alexandros Xenos*, Niki M. Foteinopoulou*, Ioanna Ntinou*, Ioannis Patras, and Georgios Tzimiropoulos
This paper proposes a novel approach to in-context emotion recognition using Vision-and-Large-Language Models (VLLMs). The method involves two stages: first, generating natural language descriptions of the subject's apparent emotion relative to the visual context using a VLLM like LlaVa-1.5. Second, using these descriptions as contextual information to train a transformer-based architecture that fuses text and visual features for final emotion classification. The approach leverages the common sense reasoning capabilities of VLLMs to enhance emotion recognition without introducing complexity to the training process. Experimental results show that the fused architecture significantly outperforms individual modalities and achieves state-of-the-art results on three datasets: EMOTIC, CAER-S, and BoLD. The method demonstrates that text and image features have complementary information, and the fusion of these features improves emotion recognition performance. The code is publicly available on GitHub.This paper proposes a novel approach to in-context emotion recognition using Vision-and-Large-Language Models (VLLMs). The method involves two stages: first, generating natural language descriptions of the subject's apparent emotion relative to the visual context using a VLLM like LlaVa-1.5. Second, using these descriptions as contextual information to train a transformer-based architecture that fuses text and visual features for final emotion classification. The approach leverages the common sense reasoning capabilities of VLLMs to enhance emotion recognition without introducing complexity to the training process. Experimental results show that the fused architecture significantly outperforms individual modalities and achieves state-of-the-art results on three datasets: EMOTIC, CAER-S, and BoLD. The method demonstrates that text and image features have complementary information, and the fusion of these features improves emotion recognition performance. The code is publicly available on GitHub.