[slides and audio] Learning Continuous 3D Words for Text-to-Image Generation

This paper introduces Continuous 3D Words, a novel method for enabling fine-grained control over 3D attributes in text-to-image generation. The approach allows users to specify continuous attributes such as illumination, shape, orientation, and camera parameters through special tokens that can be integrated into text prompts. These tokens, called Continuous 3D Words, are learned using a single 3D mesh and a rendering engine, enabling the model to generate images with precise control over these attributes without significant computational overhead. The method involves learning a continuous vocabulary that maps attributes to token embeddings, allowing for interpolation during inference. This enables users to create custom sliders for fine-grained control over image generation. The approach is trained using a two-stage process: first, learning the object identity of the mesh, then disentangling the attributes from the object identity. This prevents the model from encoding each attribute as a new object, which would hinder generalization to new objects. The paper also proposes ControlNet augmentations to enhance the model's ability to handle complex image edits. These augmentations include depth and lineart control nets, which help in generating images with subtle changes that cannot be captured by depth maps. The method is tested on various attributes, including illumination, wing pose, and dolly zoom, and shows significant improvements over existing baselines in terms of image quality and attribute control. The results demonstrate that Continuous 3D Words can effectively control multiple 3D attributes in text-to-image generation, even when the training data is limited to a single mesh. The method is lightweight and efficient, making it accessible for use with single GPUs. The paper also includes ablation studies and user studies that highlight the effectiveness of the approach in various scenarios, including real-world image editing and multi-concept control. Overall, the method provides a flexible and powerful framework for generating images with precise control over 3D attributes.This paper introduces Continuous 3D Words, a novel method for enabling fine-grained control over 3D attributes in text-to-image generation. The approach allows users to specify continuous attributes such as illumination, shape, orientation, and camera parameters through special tokens that can be integrated into text prompts. These tokens, called Continuous 3D Words, are learned using a single 3D mesh and a rendering engine, enabling the model to generate images with precise control over these attributes without significant computational overhead. The method involves learning a continuous vocabulary that maps attributes to token embeddings, allowing for interpolation during inference. This enables users to create custom sliders for fine-grained control over image generation. The approach is trained using a two-stage process: first, learning the object identity of the mesh, then disentangling the attributes from the object identity. This prevents the model from encoding each attribute as a new object, which would hinder generalization to new objects. The paper also proposes ControlNet augmentations to enhance the model's ability to handle complex image edits. These augmentations include depth and lineart control nets, which help in generating images with subtle changes that cannot be captured by depth maps. The method is tested on various attributes, including illumination, wing pose, and dolly zoom, and shows significant improvements over existing baselines in terms of image quality and attribute control. The results demonstrate that Continuous 3D Words can effectively control multiple 3D attributes in text-to-image generation, even when the training data is limited to a single mesh. The method is lightweight and efficient, making it accessible for use with single GPUs. The paper also includes ablation studies and user studies that highlight the effectiveness of the approach in various scenarios, including real-world image editing and multi-concept control. Overall, the method provides a flexible and powerful framework for generating images with precise control over 3D attributes.

Learning Continuous 3D Words for Text-to-Image Generation

13 Feb 2024 | Ta-Ying Cheng, Matheus Gadelha, Thibault Groueix, Matthew Fisher, Radomír Měch, Andrew Markham, Niki Trigoni