MS-Diffusion is a novel framework designed for zero-shot, multi-subject image personalization with layout guidance. It addresses the challenges of maintaining detailed subject fidelity and achieving cohesive representation in multi-subject images. The framework integrates grounding tokens with a feature resampler to preserve detail fidelity among subjects and employs a multi-subject cross-attention mechanism to adapt to multi-subject inputs, ensuring each subject condition acts on specific areas. This approach enhances the control over the image's multi-subject composition while preserving text control. Comprehensive experiments demonstrate that MS-Diffusion outperforms existing models in both image and text fidelity, making it a significant advancement in personalized text-to-image generation. The method is evaluated on various benchmarks, including DreamBench and MS-Bench, and shows superior performance in single-subject and multi-subject personalization tasks.MS-Diffusion is a novel framework designed for zero-shot, multi-subject image personalization with layout guidance. It addresses the challenges of maintaining detailed subject fidelity and achieving cohesive representation in multi-subject images. The framework integrates grounding tokens with a feature resampler to preserve detail fidelity among subjects and employs a multi-subject cross-attention mechanism to adapt to multi-subject inputs, ensuring each subject condition acts on specific areas. This approach enhances the control over the image's multi-subject composition while preserving text control. Comprehensive experiments demonstrate that MS-Diffusion outperforms existing models in both image and text fidelity, making it a significant advancement in personalized text-to-image generation. The method is evaluated on various benchmarks, including DreamBench and MS-Bench, and shows superior performance in single-subject and multi-subject personalization tasks.