25 Mar 2024 | Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, Chang Wen Chen
SD-DiT is a novel Diffusion Transformer model that leverages self-supervised discrimination to enhance the training efficiency and generative capacity of diffusion models. The model addresses two key limitations of existing mask-based diffusion models: training-inference discrepancy and fuzzy relations between mask reconstruction and generative diffusion processes. SD-DiT introduces a teacher-student framework where the teacher and student encoders are trained on different diffusion noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). The discriminative loss is designed to encourage inter-image alignment in the self-supervised embedding space, while the generative loss is used for diffusion tasks. The model decouples the encoder and decoder to separately handle discriminative and generative objectives, leading to more efficient training. Extensive experiments on the ImageNet dataset show that SD-DiT achieves a competitive balance between training cost and generative capacity. The model also demonstrates faster convergence and better performance compared to state-of-the-art diffusion models. The key contributions include the proposal of a new diffusion transformer structure that fully utilizes self-supervised discrimination to facilitate DiT training, and the elegant design of how to bridge the training-inference discrepancy tailored to generative tasks.SD-DiT is a novel Diffusion Transformer model that leverages self-supervised discrimination to enhance the training efficiency and generative capacity of diffusion models. The model addresses two key limitations of existing mask-based diffusion models: training-inference discrepancy and fuzzy relations between mask reconstruction and generative diffusion processes. SD-DiT introduces a teacher-student framework where the teacher and student encoders are trained on different diffusion noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). The discriminative loss is designed to encourage inter-image alignment in the self-supervised embedding space, while the generative loss is used for diffusion tasks. The model decouples the encoder and decoder to separately handle discriminative and generative objectives, leading to more efficient training. Extensive experiments on the ImageNet dataset show that SD-DiT achieves a competitive balance between training cost and generative capacity. The model also demonstrates faster convergence and better performance compared to state-of-the-art diffusion models. The key contributions include the proposal of a new diffusion transformer structure that fully utilizes self-supervised discrimination to facilitate DiT training, and the elegant design of how to bridge the training-inference discrepancy tailored to generative tasks.