SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

25 Mar 2024 | Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, Chang Wen Chen
The paper introduces SD-DiT, a novel architecture for Diffusion Transformer (DiT) that leverages self-supervised discrimination to enhance training efficiency and generative capacity. DiT, a recent advancement in generative diffusion models, faces challenges such as slow convergence and high computational costs. The proposed SD-DiT addresses these issues by decoupling the encoder and decoder, allowing for separate optimization of discriminative and generative objectives. Specifically, the teacher-student setup is used to build discriminative pairs based on diffusion noises along the Probability Flow Ordinary Differential Equation (PF-ODE). The discriminative loss is designed to align visible tokens between the teacher and student encoders in the joint embedding space, while the generative loss is optimized by the student decoder. This approach effectively separates the tasks, improving training efficiency and generative performance. Extensive experiments on the ImageNet dataset demonstrate that SD-DiT achieves a competitive balance between training speed and generative quality, outperforming state-of-the-art DiT models.The paper introduces SD-DiT, a novel architecture for Diffusion Transformer (DiT) that leverages self-supervised discrimination to enhance training efficiency and generative capacity. DiT, a recent advancement in generative diffusion models, faces challenges such as slow convergence and high computational costs. The proposed SD-DiT addresses these issues by decoupling the encoder and decoder, allowing for separate optimization of discriminative and generative objectives. Specifically, the teacher-student setup is used to build discriminative pairs based on diffusion noises along the Probability Flow Ordinary Differential Equation (PF-ODE). The discriminative loss is designed to align visible tokens between the teacher and student encoders in the joint embedding space, while the generative loss is optimized by the student decoder. This approach effectively separates the tasks, improving training efficiency and generative performance. Extensive experiments on the ImageNet dataset demonstrate that SD-DiT achieves a competitive balance between training speed and generative quality, outperforming state-of-the-art DiT models.
Reach us at info@study.space