10 Apr 2024 | Alexandros Xenos*1, Niki M. Foteinopoulou*1, Ioanna Ntinou*1, Ioannis Patras1, and Georgios Tzimiropoulos1
This paper addresses the challenge of in-context emotion recognition by leveraging the capabilities of Vision-and-Large-Language Models (VLLMs). The authors propose a two-stage approach: first, they use a VLLM to generate natural language descriptions of the subject's emotional state and context; second, they train a transformer-based architecture that fuses visual and text features to perform emotion classification. The method is evaluated on three datasets (EMOTIC, CAER-S, and BoLD) and shows superior performance compared to individual modalities and previous state-of-the-art methods. The key contributions include a novel approach to incorporating context through VLLMs and a multi-modal architecture that effectively combines visual and text information. The code for this method is available on GitHub.This paper addresses the challenge of in-context emotion recognition by leveraging the capabilities of Vision-and-Large-Language Models (VLLMs). The authors propose a two-stage approach: first, they use a VLLM to generate natural language descriptions of the subject's emotional state and context; second, they train a transformer-based architecture that fuses visual and text features to perform emotion classification. The method is evaluated on three datasets (EMOTIC, CAER-S, and BoLD) and shows superior performance compared to individual modalities and previous state-of-the-art methods. The key contributions include a novel approach to incorporating context through VLLMs and a multi-modal architecture that effectively combines visual and text information. The code for this method is available on GitHub.