Understanding InstanceDiffusion%3A Instance-Level Control for Image Generation

InstanceDiffusion is a text-to-image diffusion model that enables precise instance-level control over image generation. It allows users to specify the location and attributes of individual instances in the generated image using various formats such as bounding boxes, masks, scribbles, and points. The model supports flexible and diverse instance conditions, including multiple instances with varying attributes and locations. InstanceDiffusion outperforms existing state-of-the-art models in terms of image quality and instance attribute alignment. It achieves a 20.4% improvement in AP50 for box inputs and a 25.4% improvement in IoU for mask inputs on the COCO dataset. The model uses a unified approach to handle different location formats, improving performance by leveraging shared underlying structures. It also introduces new evaluation metrics for point and scribble inputs. InstanceDiffusion includes three key components: UniFusion, which fuses instance-level conditions with visual tokens; ScaleU, which enhances image fidelity by recalibrating features; and Multi-instance Sampler, which reduces information leakage between instances. The model is trained on a dataset with instance-level captions generated using pretrained models. InstanceDiffusion demonstrates superior ability to adhere to instance-level text prompts, achieving significant improvements in color and texture accuracy. The model is flexible and can be applied to various image generation tasks, including iterative generation where new instances can be added while preserving existing ones. Overall, InstanceDiffusion provides a powerful framework for instance-level control in image generation.InstanceDiffusion is a text-to-image diffusion model that enables precise instance-level control over image generation. It allows users to specify the location and attributes of individual instances in the generated image using various formats such as bounding boxes, masks, scribbles, and points. The model supports flexible and diverse instance conditions, including multiple instances with varying attributes and locations. InstanceDiffusion outperforms existing state-of-the-art models in terms of image quality and instance attribute alignment. It achieves a 20.4% improvement in AP50 for box inputs and a 25.4% improvement in IoU for mask inputs on the COCO dataset. The model uses a unified approach to handle different location formats, improving performance by leveraging shared underlying structures. It also introduces new evaluation metrics for point and scribble inputs. InstanceDiffusion includes three key components: UniFusion, which fuses instance-level conditions with visual tokens; ScaleU, which enhances image fidelity by recalibrating features; and Multi-instance Sampler, which reduces information leakage between instances. The model is trained on a dataset with instance-level captions generated using pretrained models. InstanceDiffusion demonstrates superior ability to adhere to instance-level text prompts, achieving significant improvements in color and texture accuracy. The model is flexible and can be applied to various image generation tasks, including iterative generation where new instances can be added while preserving existing ones. Overall, InstanceDiffusion provides a powerful framework for instance-level control in image generation.

InstanceDiffusion: Instance-level Control for Image Generation

5 Feb 2024 | Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra