10 Jul 2024 | Byeongjun Park, Hyojun Go, Jin-Young Kim, Sangmin Woo, Seokil Ham, and Changick Kim
Switch Diffusion Transformer (Switch-DiT) is a novel diffusion model architecture that synergizes multiple denoising tasks by leveraging sparse mixture-of-experts (SMoE) layers within each transformer block. The model aims to preserve semantic information while isolating parameters to handle conflicting denoising tasks. Switch-DiT introduces a diffusion prior loss that encourages similar tasks to share denoising paths while isolating conflicting ones. This loss helps stabilize the convergence of the EMA model for gating networks and ensures that the model can effectively synergize denoising tasks. The design of Switch-DiT includes a timestep-based gating network that selects experts based on denoising task characteristics, and an SMoE layer that integrates with transformer blocks to construct common and task-specific denoising paths. The model is evaluated on unconditional and class-conditional image generation tasks, demonstrating improved image quality and convergence rate. The results show that Switch-DiT outperforms existing methods in handling conflicting tasks and constructing tailored denoising paths across various generation scenarios. The architecture is efficient and effective, with extensive experiments validating its performance improvements.Switch Diffusion Transformer (Switch-DiT) is a novel diffusion model architecture that synergizes multiple denoising tasks by leveraging sparse mixture-of-experts (SMoE) layers within each transformer block. The model aims to preserve semantic information while isolating parameters to handle conflicting denoising tasks. Switch-DiT introduces a diffusion prior loss that encourages similar tasks to share denoising paths while isolating conflicting ones. This loss helps stabilize the convergence of the EMA model for gating networks and ensures that the model can effectively synergize denoising tasks. The design of Switch-DiT includes a timestep-based gating network that selects experts based on denoising task characteristics, and an SMoE layer that integrates with transformer blocks to construct common and task-specific denoising paths. The model is evaluated on unconditional and class-conditional image generation tasks, demonstrating improved image quality and convergence rate. The results show that Switch-DiT outperforms existing methods in handling conflicting tasks and constructing tailored denoising paths across various generation scenarios. The architecture is efficient and effective, with extensive experiments validating its performance improvements.