Enhance Image-to-Image Generation with LLaVA Prompt and Negative Prompt

Enhance Image-to-Image Generation with LLaVA Prompt and Negative Prompt

2024 | Zhicheng Ding, Panfeng Li, Qikai Yang, Siyang Li
This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). The proposed framework uses LLaVA to analyze input images and generate textual descriptions, known as LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. The results show a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. The framework integrates LLaVA's image understanding capabilities with Stable Diffusion's image generation prowess. LLaVA generates both positive and negative prompts based on the input image, which are then used to guide the image generation process. The positive prompts capture the essence of the input image, while the negative prompts help avoid unintended visual elements. The generated prompts are combined with the input image and fed into the image-to-image generation model, which then produces a new image that closely resembles the original input image but incorporates the specified modifications. Experiments show that using LLaVA-generated prompts leads to more similar generated images compared to traditional methods. The results indicate that LLaVA-generated prompts contribute to the generation of more similar images. The framework also includes extensive experiments to evaluate the impact of LLaVA-generated prompts on the quality and similarity of image-to-image generation. The results demonstrate that the proposed approach generates more similar images in various scenarios. The paper concludes that the proposed framework enhances image-to-image generation by leveraging LLaVA's multimodal capabilities, leading to outputs that exhibit a higher degree of similarity to the user's initial input. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process.This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). The proposed framework uses LLaVA to analyze input images and generate textual descriptions, known as LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. The results show a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. The framework integrates LLaVA's image understanding capabilities with Stable Diffusion's image generation prowess. LLaVA generates both positive and negative prompts based on the input image, which are then used to guide the image generation process. The positive prompts capture the essence of the input image, while the negative prompts help avoid unintended visual elements. The generated prompts are combined with the input image and fed into the image-to-image generation model, which then produces a new image that closely resembles the original input image but incorporates the specified modifications. Experiments show that using LLaVA-generated prompts leads to more similar generated images compared to traditional methods. The results indicate that LLaVA-generated prompts contribute to the generation of more similar images. The framework also includes extensive experiments to evaluate the impact of LLaVA-generated prompts on the quality and similarity of image-to-image generation. The results demonstrate that the proposed approach generates more similar images in various scenarios. The paper concludes that the proposed framework enhances image-to-image generation by leveraging LLaVA's multimodal capabilities, leading to outputs that exhibit a higher degree of similarity to the user's initial input. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process.
Reach us at info@study.space
Understanding Enhance Image-to-Image Generation with LLaVA-generated Prompts