4 Jun 2024 | Zhicheng Ding, Panfeng Li, Qikai Yang, Siyang Li
This paper introduces a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). The proposed framework involves LLaVA analyzing input images and generating textual prompts, which are then fed into the image-to-image generation pipeline. These enriched prompts guide the generation process to produce outputs that are more similar to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in improving visual coherence between the generated and input images. The approach aims to achieve a balance between faithfulness to the original image and artistic expression in the generated outputs. Future work will focus on fine-tuning LLaVA prompts to enhance control over the creative process.This paper introduces a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). The proposed framework involves LLaVA analyzing input images and generating textual prompts, which are then fed into the image-to-image generation pipeline. These enriched prompts guide the generation process to produce outputs that are more similar to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in improving visual coherence between the generated and input images. The approach aims to achieve a balance between faithfulness to the original image and artistic expression in the generated outputs. Future work will focus on fine-tuning LLaVA prompts to enhance control over the creative process.