31 Mar 2025 | Chenyang Zhu, Kai Li, Yue Ma, Chunming He, Xiu Li
This paper introduces MultiBooth, a method for generating images from texts containing various concepts. MultiBooth addresses the challenges of multi-concept generation by dividing the process into two phases: single-concept learning and multi-concept integration. During the single-concept learning phase, a multi-modal image encoder and an efficient concept encoding technique are used to learn concise and discriminative representations for each concept. In the multi-concept integration phase, bounding boxes are used to define the generation area for each concept within the cross-attention map, enabling the creation of individual concepts within their specified regions. This approach improves concept fidelity and reduces inference cost. MultiBooth outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior performance and computational efficiency in multi-concept customization.This paper introduces MultiBooth, a method for generating images from texts containing various concepts. MultiBooth addresses the challenges of multi-concept generation by dividing the process into two phases: single-concept learning and multi-concept integration. During the single-concept learning phase, a multi-modal image encoder and an efficient concept encoding technique are used to learn concise and discriminative representations for each concept. In the multi-concept integration phase, bounding boxes are used to define the generation area for each concept within the cross-attention map, enabling the creation of individual concepts within their specified regions. This approach improves concept fidelity and reduces inference cost. MultiBooth outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior performance and computational efficiency in multi-concept customization.