3 Jan 2024 | Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhui Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia
Instruct-Imagen is a model that enables multi-modal instruction for image generation, allowing it to handle a wide range of image generation tasks and generalize across unseen tasks. The model uses natural language to combine various modalities such as text, edge, style, and subject, enabling the generation of images that reflect complex and unseen transformations. The model is trained in two stages: first, by adapting a pre-trained text-to-image diffusion model using retrieval-augmented training to enhance its ability to ground generation on external multi-modal context. Then, it is fine-tuned on diverse image generation tasks that require vision-language understanding, each paired with a multimodal instruction that encapsulates the task's essence. Human evaluation on various image generation datasets shows that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. The model's ability to understand and follow multi-modal instructions allows it to generate images that reflect the complex and unseen transformations, even when the instruction combination has never been observed before. Instruct-Imagen is a unified model that tackles heterogeneous image generation tasks, surpassing several state-of-the-art models in their domains. It generalizes to unseen and complex tasks without any ad hoc design. The model's design allows it to process multi-modal instructions and generate images that reflect the complex and unseen transformations. Instruct-Imagen is trained using a two-stage approach, first by continuing the text-to-image generation training of a pre-trained diffusion model, supplemented by similar (image, text) contexts retrieved from a web-scale (image, text) corpus. In the second stage, the model is fine-tuned on diverse image generation tasks, each paired with multi-modal instructions that encapsulate the essence of the task. The model's ability to process multi-modal instructions allows it to generate images that reflect the complex and unseen transformations. Instruct-Imagen is evaluated on various image generation tasks and shows strong performance in both in-domain and zero-shot evaluations. The model's ability to understand and follow multi-modal instructions allows it to generate images that reflect the complex and unseen transformations, even when the instruction combination has never been observed before. Instruct-Imagen is a model that enables multi-modal instruction for image generation, allowing it to handle a wide range of image generation tasks and generalize across unseen tasks. The model uses natural language to combine various modalities such as text, edge, style, and subject, enabling the generation of images that reflect complex and unseen transformations. The model is trained in two stages: first, by adapting a pre-trained text-to-image diffusion model using retrieval-augmented training to enhance its ability to ground generation on external multi-modal context. Then, it is fine-tuned on diverse image generation tasks that require vision-language understanding, each paired with a multimodal instruction that encapsulates the task's essence. Human evaluation on various image generation datasets shows that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstratesInstruct-Imagen is a model that enables multi-modal instruction for image generation, allowing it to handle a wide range of image generation tasks and generalize across unseen tasks. The model uses natural language to combine various modalities such as text, edge, style, and subject, enabling the generation of images that reflect complex and unseen transformations. The model is trained in two stages: first, by adapting a pre-trained text-to-image diffusion model using retrieval-augmented training to enhance its ability to ground generation on external multi-modal context. Then, it is fine-tuned on diverse image generation tasks that require vision-language understanding, each paired with a multimodal instruction that encapsulates the task's essence. Human evaluation on various image generation datasets shows that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. The model's ability to understand and follow multi-modal instructions allows it to generate images that reflect the complex and unseen transformations, even when the instruction combination has never been observed before. Instruct-Imagen is a unified model that tackles heterogeneous image generation tasks, surpassing several state-of-the-art models in their domains. It generalizes to unseen and complex tasks without any ad hoc design. The model's design allows it to process multi-modal instructions and generate images that reflect the complex and unseen transformations. Instruct-Imagen is trained using a two-stage approach, first by continuing the text-to-image generation training of a pre-trained diffusion model, supplemented by similar (image, text) contexts retrieved from a web-scale (image, text) corpus. In the second stage, the model is fine-tuned on diverse image generation tasks, each paired with multi-modal instructions that encapsulate the essence of the task. The model's ability to process multi-modal instructions allows it to generate images that reflect the complex and unseen transformations. Instruct-Imagen is evaluated on various image generation tasks and shows strong performance in both in-domain and zero-shot evaluations. The model's ability to understand and follow multi-modal instructions allows it to generate images that reflect the complex and unseen transformations, even when the instruction combination has never been observed before. Instruct-Imagen is a model that enables multi-modal instruction for image generation, allowing it to handle a wide range of image generation tasks and generalize across unseen tasks. The model uses natural language to combine various modalities such as text, edge, style, and subject, enabling the generation of images that reflect complex and unseen transformations. The model is trained in two stages: first, by adapting a pre-trained text-to-image diffusion model using retrieval-augmented training to enhance its ability to ground generation on external multi-modal context. Then, it is fine-tuned on diverse image generation tasks that require vision-language understanding, each paired with a multimodal instruction that encapsulates the task's essence. Human evaluation on various image generation datasets shows that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates