[slides] Faster Diffusion via Temporal Attention Decomposition

The paper explores the role of attention mechanisms in the inference process of text-conditional diffusion models, particularly focusing on the temporal dynamics of cross-attention and self-attention. Empirical observations reveal that cross-attention outputs converge to a fixed point within the first few inference steps, dividing the denoising process into two phases: a semantic-planning phase and a fidelity-improving phase. Cross-attention is crucial in the initial phase for generating meaningful semantics, while it becomes redundant in the later phase. In contrast, self-attention plays a minor role initially but becomes crucial in the second phase. These findings lead to the development of Temporally Gating the Attention (T-GATE), a training-free method that caches and reuses attention outputs at scheduled time steps, significantly reducing computational costs without compromising image quality. Experimental results show that T-GATE can accelerate various text-conditional diffusion models by 10% to 50%, demonstrating its broad applicability and efficiency. The method is applicable to both U-Net and transformer-based architectures and is orthogonal to different noise schedulers and acceleration methods.The paper explores the role of attention mechanisms in the inference process of text-conditional diffusion models, particularly focusing on the temporal dynamics of cross-attention and self-attention. Empirical observations reveal that cross-attention outputs converge to a fixed point within the first few inference steps, dividing the denoising process into two phases: a semantic-planning phase and a fidelity-improving phase. Cross-attention is crucial in the initial phase for generating meaningful semantics, while it becomes redundant in the later phase. In contrast, self-attention plays a minor role initially but becomes crucial in the second phase. These findings lead to the development of Temporally Gating the Attention (T-GATE), a training-free method that caches and reuses attention outputs at scheduled time steps, significantly reducing computational costs without compromising image quality. Experimental results show that T-GATE can accelerate various text-conditional diffusion models by 10% to 50%, demonstrating its broad applicability and efficiency. The method is applicable to both U-Net and transformer-based architectures and is orthogonal to different noise schedulers and acceleration methods.

Faster Diffusion via Temporal Attention Decomposition

17 Jul 2024 | Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Faccio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan-Manuel Perez-Rua, Jürgen Schmidhuber