An Image is Worth 32 Tokens for Reconstruction and Generation

An Image is Worth 32 Tokens for Reconstruction and Generation

11 Jun 2024 | Qihang Yu*, Mark Weber*, Xueqing Deng, Xiaohui Shen, Daniel Cremers*, Liang-Chieh Chen
An Image is Worth 32 Tokens for Reconstruction and Generation This paper introduces TiTok, a compact 1D tokenizer that represents images with only 32 tokens for image reconstruction and generation. Unlike traditional 2D tokenization methods, TiTok leverages region redundancy to create a more compact and efficient latent representation. A 256x256x3 image can be reduced to just 32 discrete tokens, significantly fewer than the 256 or 1024 tokens used by prior methods. Despite its compact nature, TiTok achieves competitive performance, outperforming state-of-the-art approaches in terms of generation quality and speed. For example, using the same generator framework, TiTok achieves a gFID of 1.97, outperforming the MaskGIT baseline by 4.21 on the ImageNet 256x256 benchmark. At the ImageNet 512x512 benchmark, TiTok not only outperforms the state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64×, leading to 410× faster generation process. The best-performing variant of TiTok can significantly surpass DiT-XL/2 (gFID 2.13 vs. 3.04) while generating high-quality samples 74× faster. TiTok is based on a transformer-based framework that tokenizes images into 1D latent sequences. It consists of a Vision Transformer (ViT) encoder, a ViT decoder, and a vector quantizer. The image is split into patches, which are then concatenated with a 1D sequence of latent tokens. After feature encoding, these latent tokens build the latent representation of the image. Subsequent vector quantization and ViT decoding are used to reconstruct the input images from the masked token sequence. The paper also explores the dynamics of 1D image tokenization, finding that increasing the number of latent tokens consistently improves reconstruction performance, but the benefit becomes marginal after 128 tokens. 32 tokens are sufficient for reasonable image reconstruction. Scaling up the tokenizer model size significantly improves performance, especially when the number of tokens is limited. 1D tokenization breaks the grid constraints of prior 2D image tokenizers, enabling more flexible tokenizer design and learning more high-level and semantic-rich image information. 1D tokenization also exhibits superior performance in generative training, with significant speed-ups for both training and inference and competitive FID scores compared to typical 2D tokenizers, while using much fewer tokens. The paper also presents a two-stage training strategy for TiTok, which includes a "warm-up" stage and a "decoder fine-tuning" stage. The "warm-up" stage trains 1D VQ models withAn Image is Worth 32 Tokens for Reconstruction and Generation This paper introduces TiTok, a compact 1D tokenizer that represents images with only 32 tokens for image reconstruction and generation. Unlike traditional 2D tokenization methods, TiTok leverages region redundancy to create a more compact and efficient latent representation. A 256x256x3 image can be reduced to just 32 discrete tokens, significantly fewer than the 256 or 1024 tokens used by prior methods. Despite its compact nature, TiTok achieves competitive performance, outperforming state-of-the-art approaches in terms of generation quality and speed. For example, using the same generator framework, TiTok achieves a gFID of 1.97, outperforming the MaskGIT baseline by 4.21 on the ImageNet 256x256 benchmark. At the ImageNet 512x512 benchmark, TiTok not only outperforms the state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64×, leading to 410× faster generation process. The best-performing variant of TiTok can significantly surpass DiT-XL/2 (gFID 2.13 vs. 3.04) while generating high-quality samples 74× faster. TiTok is based on a transformer-based framework that tokenizes images into 1D latent sequences. It consists of a Vision Transformer (ViT) encoder, a ViT decoder, and a vector quantizer. The image is split into patches, which are then concatenated with a 1D sequence of latent tokens. After feature encoding, these latent tokens build the latent representation of the image. Subsequent vector quantization and ViT decoding are used to reconstruct the input images from the masked token sequence. The paper also explores the dynamics of 1D image tokenization, finding that increasing the number of latent tokens consistently improves reconstruction performance, but the benefit becomes marginal after 128 tokens. 32 tokens are sufficient for reasonable image reconstruction. Scaling up the tokenizer model size significantly improves performance, especially when the number of tokens is limited. 1D tokenization breaks the grid constraints of prior 2D image tokenizers, enabling more flexible tokenizer design and learning more high-level and semantic-rich image information. 1D tokenization also exhibits superior performance in generative training, with significant speed-ups for both training and inference and competitive FID scores compared to typical 2D tokenizers, while using much fewer tokens. The paper also presents a two-stage training strategy for TiTok, which includes a "warm-up" stage and a "decoder fine-tuning" stage. The "warm-up" stage trains 1D VQ models with
Reach us at info@study.space
[slides and audio] An Image is Worth 32 Tokens for Reconstruction and Generation