Image Fusion via Vision-Language Model

Image Fusion via Vision-Language Model

2024 | Zixiang Zhao 1 2 Lilun Deng 1 Haowen Bai 1 Yukun Cui 1 Zhipeng Zhang 2 3 Yulun Zhang 4 Haotong Qin 2 Dongdong Chen 5 Jiangshe Zhang 1 Peng Wang 3 Luc Van Gool 1 2 6 7
The paper introduces a novel image fusion paradigm called *Image Fusion via Vision-Language Model* (FILM), which leverages explicit textual information from source images to guide the fusion process. FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions, which are then fused within the textual domain and used to guide the extraction and fusion of visual features via cross-attention. This approach enhances feature extraction and contextual understanding, directed by textual semantic information. The method has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. Additionally, the paper proposes a vision-language dataset containing ChatGPT-generated paragraph descriptions for eight image fusion datasets, facilitating future research in vision-language model-based image fusion. The code and dataset are available at <https://github.com/Zhaozixiang1228/IF-FILM>.The paper introduces a novel image fusion paradigm called *Image Fusion via Vision-Language Model* (FILM), which leverages explicit textual information from source images to guide the fusion process. FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions, which are then fused within the textual domain and used to guide the extraction and fusion of visual features via cross-attention. This approach enhances feature extraction and contextual understanding, directed by textual semantic information. The method has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. Additionally, the paper proposes a vision-language dataset containing ChatGPT-generated paragraph descriptions for eight image fusion datasets, facilitating future research in vision-language model-based image fusion. The code and dataset are available at <https://github.com/Zhaozixiang1228/IF-FILM>.
Reach us at info@study.space