12 Jun 2024 | Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord
This paper presents a novel framework for interpreting large multimodal models (LMMs), which combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. The authors propose a dictionary learning-based approach to extract "multimodal concepts" from the internal representations of LMMs. These concepts are semantically grounded in both visual and textual domains, allowing for qualitative and quantitative evaluations of their effectiveness. The method is evaluated on the DePALM model trained for captioning tasks on the COCO dataset, and the results show that the learned concepts are useful for interpreting test sample representations. The paper also discusses the overlap and disentanglement of learned concepts and provides visual and textual grounding examples. The authors conclude by highlighting the limitations of their method and its broader societal impact, emphasizing the need for further research in this area.This paper presents a novel framework for interpreting large multimodal models (LMMs), which combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. The authors propose a dictionary learning-based approach to extract "multimodal concepts" from the internal representations of LMMs. These concepts are semantically grounded in both visual and textual domains, allowing for qualitative and quantitative evaluations of their effectiveness. The method is evaluated on the DePALM model trained for captioning tasks on the COCO dataset, and the results show that the learned concepts are useful for interpreting test sample representations. The paper also discusses the overlap and disentanglement of learned concepts and provides visual and textual grounding examples. The authors conclude by highlighting the limitations of their method and its broader societal impact, emphasizing the need for further research in this area.