28 Oct 2024 | Sangwon Jang*, Jaehyeong Jo*, Kimin Lee†, Sung Ju Hwang†, 1, 2
The paper introduces MuDI, a novel framework for multi-subject personalization in text-to-image models, addressing the issue of identity mixing when generating images of multiple subjects. MuDI leverages segmented subjects obtained from a foundation model (Segment Anything Model, SAM) for both training and inference, effectively decoupling the identities of different subjects. The key contributions include:
1. **Seg-Mix Data Augmentation**: This method randomly composes segmented subjects during training, removing identity-irrelevant information and preventing identity mixing.
2. **Mean-Shifted Noise Initialization**: A novel inference method initializes the generation process with mean-shifted noise created from segmented subjects, enhancing identity separation and reducing subject dominance.
3. **Detect-and-Compare Metric**: A new metric, D&C, is introduced to evaluate the fidelity of multiple subjects in generated images, capturing the degree of identity mixing.
Experimental results demonstrate that MuDI significantly outperforms existing methods in preventing identity mixing, as shown in qualitative and quantitative evaluations. Human evaluations also confirm the superior performance of MuDI, with raters preferring images generated by MuDI over other methods by more than 70%. The framework is model-agnostic and can be applied to various text-to-image models, including Stable Diffusion XL (SDXL).The paper introduces MuDI, a novel framework for multi-subject personalization in text-to-image models, addressing the issue of identity mixing when generating images of multiple subjects. MuDI leverages segmented subjects obtained from a foundation model (Segment Anything Model, SAM) for both training and inference, effectively decoupling the identities of different subjects. The key contributions include:
1. **Seg-Mix Data Augmentation**: This method randomly composes segmented subjects during training, removing identity-irrelevant information and preventing identity mixing.
2. **Mean-Shifted Noise Initialization**: A novel inference method initializes the generation process with mean-shifted noise created from segmented subjects, enhancing identity separation and reducing subject dominance.
3. **Detect-and-Compare Metric**: A new metric, D&C, is introduced to evaluate the fidelity of multiple subjects in generated images, capturing the degree of identity mixing.
Experimental results demonstrate that MuDI significantly outperforms existing methods in preventing identity mixing, as shown in qualitative and quantitative evaluations. Human evaluations also confirm the superior performance of MuDI, with raters preferring images generated by MuDI over other methods by more than 70%. The framework is model-agnostic and can be applied to various text-to-image models, including Stable Diffusion XL (SDXL).