11 Jun 2024 | Qihang Yu1*, Mark Weber1-2*, Xueqing Deng1, Xiaohui Shen1, Daniel Cremers2, Liang-Chieh Chen1
The paper introduces TiTok, a novel 1D tokenizer that leverages region redundancy to represent images with only 32 tokens for reconstruction and generation. Unlike traditional 2D tokenizers, TiTok tokenizes images into a 1D latent sequence, which is more compact and efficient. The method uses a Transformer-based architecture, including a Vision Transformer (ViT) encoder and decoder, and a vector quantizer. TiTok achieves significant improvements in both reconstruction and generation performance compared to state-of-the-art 2D tokenizers, while using significantly fewer tokens. Specifically, on the ImageNet 256 × 256 benchmark, TiTok-L-32 achieves a gFID of 2.21, outperforming MaskGPT by 4.21. On the ImageNet 512 × 512 benchmark, TiTok-L-64 outperforms DiT-XL/2 by 0.08 gFID while being 410× faster. The paper also discusses the effectiveness of 1D tokenization in capturing high-level and semantic-rich information, and provides insights into the training and inference efficiency of TiTok.The paper introduces TiTok, a novel 1D tokenizer that leverages region redundancy to represent images with only 32 tokens for reconstruction and generation. Unlike traditional 2D tokenizers, TiTok tokenizes images into a 1D latent sequence, which is more compact and efficient. The method uses a Transformer-based architecture, including a Vision Transformer (ViT) encoder and decoder, and a vector quantizer. TiTok achieves significant improvements in both reconstruction and generation performance compared to state-of-the-art 2D tokenizers, while using significantly fewer tokens. Specifically, on the ImageNet 256 × 256 benchmark, TiTok-L-32 achieves a gFID of 2.21, outperforming MaskGPT by 4.21. On the ImageNet 512 × 512 benchmark, TiTok-L-64 outperforms DiT-XL/2 by 0.08 gFID while being 410× faster. The paper also discusses the effectiveness of 1D tokenization in capturing high-level and semantic-rich information, and provides insights into the training and inference efficiency of TiTok.