RealCustom is a novel text-to-image customization paradigm that disentangles similarity from controllability by precisely limiting the influence of given subjects to relevant parts. Unlike existing methods that use pseudo-words, RealCustom narrows real text words from their general connotation to specific subjects, leveraging cross-attention to distinguish relevance. This approach enables both high-quality similarity and controllability in real-time open-domain scenarios. The core idea is to iteratively update the influence scope and quantity of the given subjects, allowing the generation of images that closely match the given subjects while maintaining control over the rest of the image.
RealCustom introduces a "train-inference" decoupled framework. During training, it learns generalized alignment between visual conditions and text conditions using an adaptive scoring module. During inference, an adaptive mask guidance strategy is used to iteratively narrow the influence of the given subjects, gradually focusing the generation on the specific subject. This strategy involves two branches: a text-to-image (T2I) branch and a text&image-to-image (TI2I) branch. The T2I branch calculates the influence scope by aggregating cross-attention maps, while the TI2I branch injects the influence quantity into the scope.
Comprehensive experiments demonstrate that RealCustom achieves superior real-time customization ability, achieving both unprecedented similarity of the given subjects and controllability of the given text. The project page is available at https://corleonehuang.github.io/realcustom/. RealCustom outperforms existing methods in all metrics, including controllability and similarity, and shows excellent open-domain customization capability. It is able to generate high-quality images that closely match the given subjects while maintaining control over the rest of the image. The method is trained on general text-image datasets, enabling it to be applied to any categories using any target real words.RealCustom is a novel text-to-image customization paradigm that disentangles similarity from controllability by precisely limiting the influence of given subjects to relevant parts. Unlike existing methods that use pseudo-words, RealCustom narrows real text words from their general connotation to specific subjects, leveraging cross-attention to distinguish relevance. This approach enables both high-quality similarity and controllability in real-time open-domain scenarios. The core idea is to iteratively update the influence scope and quantity of the given subjects, allowing the generation of images that closely match the given subjects while maintaining control over the rest of the image.
RealCustom introduces a "train-inference" decoupled framework. During training, it learns generalized alignment between visual conditions and text conditions using an adaptive scoring module. During inference, an adaptive mask guidance strategy is used to iteratively narrow the influence of the given subjects, gradually focusing the generation on the specific subject. This strategy involves two branches: a text-to-image (T2I) branch and a text&image-to-image (TI2I) branch. The T2I branch calculates the influence scope by aggregating cross-attention maps, while the TI2I branch injects the influence quantity into the scope.
Comprehensive experiments demonstrate that RealCustom achieves superior real-time customization ability, achieving both unprecedented similarity of the given subjects and controllability of the given text. The project page is available at https://corleonehuang.github.io/realcustom/. RealCustom outperforms existing methods in all metrics, including controllability and similarity, and shows excellent open-domain customization capability. It is able to generate high-quality images that closely match the given subjects while maintaining control over the rest of the image. The method is trained on general text-image datasets, enabling it to be applied to any categories using any target real words.