Masked Completion via Structured Diffusion with White-Box Transformers

Masked Completion via Structured Diffusion with White-Box Transformers

2024 | Druv Pai, Ziyang Wu, Sam Buchanan, Yaodong Yu, Yi Ma
This paper introduces CRATE-MAE, a white-box transformer-like architecture for unsupervised representation learning. The key idea is to connect diffusion, compression, and masked completion to design a deep transformer-like masked autoencoder. CRATE-MAE is mathematically interpretable, with each layer explicitly transforming data distribution to and from a structured representation. The model is trained using a masked autoencoding task, where it reconstructs images from masked inputs. CRATE-MAE achieves promising performance on large-scale image datasets while using only ~30% of the parameters of standard masked autoencoders. The learned representations have explicit structure and semantic meaning. The model is built by unrolling optimization steps that iteratively compress and sparsify data, and it is shown that these operations are mathematically equivalent to denoising. The model's encoder and decoder are designed to be distributionally invertible, enabling efficient autoencoding. The paper demonstrates that CRATE-MAE outperforms traditional masked autoencoders in terms of parameter efficiency and semantic representation quality. The results show that CRATE-MAE can learn structured, semantically meaningful representations, and that its white-box design allows for a deeper understanding of the learned representations. The model is evaluated on various tasks, including image reconstruction and classification, and shows competitive performance with larger models. The paper also highlights the potential of white-box networks for tasks beyond supervised classification, by unifying concepts such as diffusion, compression, and autoencoding.This paper introduces CRATE-MAE, a white-box transformer-like architecture for unsupervised representation learning. The key idea is to connect diffusion, compression, and masked completion to design a deep transformer-like masked autoencoder. CRATE-MAE is mathematically interpretable, with each layer explicitly transforming data distribution to and from a structured representation. The model is trained using a masked autoencoding task, where it reconstructs images from masked inputs. CRATE-MAE achieves promising performance on large-scale image datasets while using only ~30% of the parameters of standard masked autoencoders. The learned representations have explicit structure and semantic meaning. The model is built by unrolling optimization steps that iteratively compress and sparsify data, and it is shown that these operations are mathematically equivalent to denoising. The model's encoder and decoder are designed to be distributionally invertible, enabling efficient autoencoding. The paper demonstrates that CRATE-MAE outperforms traditional masked autoencoders in terms of parameter efficiency and semantic representation quality. The results show that CRATE-MAE can learn structured, semantically meaningful representations, and that its white-box design allows for a deeper understanding of the learned representations. The model is evaluated on various tasks, including image reconstruction and classification, and shows competitive performance with larger models. The paper also highlights the potential of white-box networks for tasks beyond supervised classification, by unifying concepts such as diffusion, compression, and autoencoding.
Reach us at info@study.space