3 Jan 2024 | Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhua Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia
This paper introduces Instruct-Imagen, a model designed to handle heterogeneous image generation tasks and generalize to unseen tasks. The key innovation is the introduction of multi-modal instructions, which integrate various modalities (e.g., text, edge, style, subject) to standardize complex generation intentions in a uniform format. The model is trained in two stages: first, using retrieval-augmented training to enhance its ability to ground generation on external multi-modal context; second, fine-tuning on diverse image generation tasks paired with multi-modal instructions. Human evaluations on various datasets show that Instruct-Imagen matches or surpasses prior task-specific models in their domains and demonstrates promising generalization to unseen and more complex tasks. The paper also discusses the model's architecture, training details, and limitations, highlighting its potential for future research and applications.This paper introduces Instruct-Imagen, a model designed to handle heterogeneous image generation tasks and generalize to unseen tasks. The key innovation is the introduction of multi-modal instructions, which integrate various modalities (e.g., text, edge, style, subject) to standardize complex generation intentions in a uniform format. The model is trained in two stages: first, using retrieval-augmented training to enhance its ability to ground generation on external multi-modal context; second, fine-tuning on diverse image generation tasks paired with multi-modal instructions. Human evaluations on various datasets show that Instruct-Imagen matches or surpasses prior task-specific models in their domains and demonstrates promising generalization to unseen and more complex tasks. The paper also discusses the model's architecture, training details, and limitations, highlighting its potential for future research and applications.