11 Jun 2024 | Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl
This paper introduces a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects high-dimensional visual embeddings into a lower-dimensional hypersphere and applies binary quantization, achieving state-of-the-art visual reconstruction quality with 2.4× faster throughput than prior methods. The tokenizer uses a transformer encoder and decoder with block-wise causal masking to support variable-length videos. BSQ is parameter-efficient, scalable, and compact, compressing visual data by up to 100× with minimal distortion. It also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods. The proposed BSQ-ViT achieves comparable results on video compression with state-of-the-art standards like H.264 and HEVC. The tokenizer is trained end-to-end within the VQ-GAN framework and integrates seamlessly with existing models. BSQ's quantization error is bounded, leading to faster and better convergence than other methods. The paper also presents ablation studies showing that BSQ outperforms VQ in reconstruction quality and computational efficiency. The results demonstrate that BSQ-ViT achieves high-quality image and video reconstruction, efficient compression, and competitive image generation. The code and models are available at https://github.com/zhaoyue-zephyrus/bsq-vit.This paper introduces a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects high-dimensional visual embeddings into a lower-dimensional hypersphere and applies binary quantization, achieving state-of-the-art visual reconstruction quality with 2.4× faster throughput than prior methods. The tokenizer uses a transformer encoder and decoder with block-wise causal masking to support variable-length videos. BSQ is parameter-efficient, scalable, and compact, compressing visual data by up to 100× with minimal distortion. It also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods. The proposed BSQ-ViT achieves comparable results on video compression with state-of-the-art standards like H.264 and HEVC. The tokenizer is trained end-to-end within the VQ-GAN framework and integrates seamlessly with existing models. BSQ's quantization error is bounded, leading to faster and better convergence than other methods. The paper also presents ablation studies showing that BSQ outperforms VQ in reconstruction quality and computational efficiency. The results demonstrate that BSQ-ViT achieves high-quality image and video reconstruction, efficient compression, and competitive image generation. The code and models are available at https://github.com/zhaoyue-zephyrus/bsq-vit.