August 22, 2024 | Jerry Yao-Chieh Hu, Weimin Wu, Zhao Song, Han Liu
This paper investigates the statistical and computational limits of latent Diffusion Transformers (DiTs) under the assumption that data lies on a low-dimensional linear subspace. Statistically, the study focuses on the universal approximation and sample complexity of the DiT score function, as well as the distribution recovery property of the initial data. Under mild data assumptions, the paper derives an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. It also shows that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, the paper characterizes the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, it identifies efficient criteria for all possible latent DiT inference algorithms and showcases the theory by pushing the efficiency toward almost-linear time inference. For backward computation, it leverages the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, it shows that such speedup achieves almost-linear time latent DiT training by casting the DiT gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, the convergence rate and computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.This paper investigates the statistical and computational limits of latent Diffusion Transformers (DiTs) under the assumption that data lies on a low-dimensional linear subspace. Statistically, the study focuses on the universal approximation and sample complexity of the DiT score function, as well as the distribution recovery property of the initial data. Under mild data assumptions, the paper derives an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. It also shows that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, the paper characterizes the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, it identifies efficient criteria for all possible latent DiT inference algorithms and showcases the theory by pushing the efficiency toward almost-linear time inference. For backward computation, it leverages the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, it shows that such speedup achieves almost-linear time latent DiT training by casting the DiT gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, the convergence rate and computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.