This paper introduces a novel image fusion approach called FILM, which integrates vision-language models (VLMs) to enhance the fusion process by leveraging textual semantic information. FILM generates semantic prompts from images and uses them to guide the fusion of visual features, improving feature extraction and contextual understanding through cross-attention mechanisms. The method is evaluated on four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. It also introduces a new vision-language benchmark dataset containing ChatGPT-generated descriptions for these tasks, facilitating future research in vision-language model-based image fusion. The proposed method demonstrates promising results across various applications, showing its effectiveness in enhancing image fusion quality by incorporating deeper semantic information from text. The workflow of FILM involves text feature fusion, language-guided vision feature fusion, and vision feature decoding, with the text features guiding the extraction and fusion of visual features. The method is trained using a combination of loss functions, including perceptual loss and structural similarity index (SSIM) loss, and achieves state-of-the-art performance on multiple image fusion tasks. The results show that FILM outperforms existing methods in terms of visual quality and quantitative metrics, demonstrating its potential for more accurate and context-aware image fusion.This paper introduces a novel image fusion approach called FILM, which integrates vision-language models (VLMs) to enhance the fusion process by leveraging textual semantic information. FILM generates semantic prompts from images and uses them to guide the fusion of visual features, improving feature extraction and contextual understanding through cross-attention mechanisms. The method is evaluated on four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. It also introduces a new vision-language benchmark dataset containing ChatGPT-generated descriptions for these tasks, facilitating future research in vision-language model-based image fusion. The proposed method demonstrates promising results across various applications, showing its effectiveness in enhancing image fusion quality by incorporating deeper semantic information from text. The workflow of FILM involves text feature fusion, language-guided vision feature fusion, and vision feature decoding, with the text features guiding the extraction and fusion of visual features. The method is trained using a combination of loss functions, including perceptual loss and structural similarity index (SSIM) loss, and achieves state-of-the-art performance on multiple image fusion tasks. The results show that FILM outperforms existing methods in terms of visual quality and quantitative metrics, demonstrating its potential for more accurate and context-aware image fusion.