21 Jan 2024 | Zhongqi Yue, Jiankun Wang, Qianru Sun, Lei Ji, Eric I-Chao Chang, Hanwang Zhang
This paper explores the potential of diffusion time-steps in unsupervised representation learning. The authors propose a method called DiTi, which leverages the connection between diffusion time-steps and hidden modular attributes to learn a disentangled representation. The key idea is that as diffusion time increases, more attributes are lost, and the task is to learn features that compensate for these lost attributes. The method involves training a feature encoder to capture the cumulative set of lost attributes at each time-step, enabling accurate attribute classification and faithful counterfactual generation. The approach is validated on CelebA, FFHQ, and Bedroom datasets, showing significant improvements in attribute inference accuracy and counterfactual generation. The results demonstrate that DiTi effectively captures the modular structure of hidden attributes, leading to robust and fair AI models. The method is implemented using a pre-trained diffusion model and a trainable encoder-decoder architecture, with the encoder learning time-step-specific features to compensate for reconstruction errors. The paper also discusses the theoretical foundations of the approach, including the relationship between attribute loss and diffusion time-steps, and provides ablation studies to evaluate the effectiveness of different design choices. The results show that DiTi outperforms existing methods in both inference and generation tasks, highlighting the importance of leveraging diffusion time-steps as an inductive bias for unsupervised representation learning.This paper explores the potential of diffusion time-steps in unsupervised representation learning. The authors propose a method called DiTi, which leverages the connection between diffusion time-steps and hidden modular attributes to learn a disentangled representation. The key idea is that as diffusion time increases, more attributes are lost, and the task is to learn features that compensate for these lost attributes. The method involves training a feature encoder to capture the cumulative set of lost attributes at each time-step, enabling accurate attribute classification and faithful counterfactual generation. The approach is validated on CelebA, FFHQ, and Bedroom datasets, showing significant improvements in attribute inference accuracy and counterfactual generation. The results demonstrate that DiTi effectively captures the modular structure of hidden attributes, leading to robust and fair AI models. The method is implemented using a pre-trained diffusion model and a trainable encoder-decoder architecture, with the encoder learning time-step-specific features to compensate for reconstruction errors. The paper also discusses the theoretical foundations of the approach, including the relationship between attribute loss and diffusion time-steps, and provides ablation studies to evaluate the effectiveness of different design choices. The results show that DiTi outperforms existing methods in both inference and generation tasks, highlighting the importance of leveraging diffusion time-steps as an inductive bias for unsupervised representation learning.