2025 | Chenyang Zhu, Kai Li, Yue Ma, Chunming He, Xiu Li
MultiBooth is a method for generating images from text that includes multiple concepts. It addresses the challenges of multi-concept generation by dividing the process into two phases: single-concept learning and multi-concept integration. In the single-concept learning phase, a multi-modal image encoder and efficient concept encoding technique are used to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, bounding boxes are used to define the generation area for each concept within the cross-attention map, enabling the creation of multi-concept images. This method improves concept fidelity and reduces inference cost. MultiBooth outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior performance and computational efficiency. The method allows for plug-and-play multi-concept generation after separate customization of each concept. It introduces adaptive concept normalization to mitigate domain gaps in the embedding space and a regional customization module to effectively combine multiple single-concept modules for multi-concept generation. The approach is validated with various subjects, including pets, objects, and scenes, showing strong concept fidelity and prompt alignment. MultiBooth is efficient, with minimal inference cost, and can handle complex object interactions while maintaining high image fidelity. The method is compared with existing approaches, showing superior performance in image quality, faithfulness to concepts, and alignment with text prompts.MultiBooth is a method for generating images from text that includes multiple concepts. It addresses the challenges of multi-concept generation by dividing the process into two phases: single-concept learning and multi-concept integration. In the single-concept learning phase, a multi-modal image encoder and efficient concept encoding technique are used to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, bounding boxes are used to define the generation area for each concept within the cross-attention map, enabling the creation of multi-concept images. This method improves concept fidelity and reduces inference cost. MultiBooth outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior performance and computational efficiency. The method allows for plug-and-play multi-concept generation after separate customization of each concept. It introduces adaptive concept normalization to mitigate domain gaps in the embedding space and a regional customization module to effectively combine multiple single-concept modules for multi-concept generation. The approach is validated with various subjects, including pets, objects, and scenes, showing strong concept fidelity and prompt alignment. MultiBooth is efficient, with minimal inference cost, and can handle complex object interactions while maintaining high image fidelity. The method is compared with existing approaches, showing superior performance in image quality, faithfulness to concepts, and alignment with text prompts.