FiT: Flexible Vision Transformer for Diffusion Model

FiT: Flexible Vision Transformer for Diffusion Model

19 Feb 2024 | Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai
This paper introduces the Flexible Vision Transformer (FiT), a transformer architecture designed for diffusion models to generate images at unrestricted resolutions and aspect ratios. Unlike traditional methods that treat images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens, enabling flexible training and inference that adapt to diverse aspect ratios. FiT is enhanced with a carefully adjusted network structure and training-free extrapolation techniques, allowing it to generate images at any resolution without being constrained by predefined dimensions. Comprehensive experiments show that FiT outperforms existing models across various resolutions, demonstrating its effectiveness both within and beyond its training resolution distribution. The FiT-XL/2 model, trained for 1.8 million steps on the ImageNet-256 dataset, achieves state-of-the-art performance across multiple resolutions, including 160×320, 128×384, 224×448, and 160×480. The performance of FiT-XL/2 is further improved with a training-free resolution extrapolation method, surpassing the baseline DiT-XL/2 model at all resolutions except 256×256. FiT's contributions include a flexible training pipeline that eliminates the need for cropping, a unique transformer architecture for dynamic token length modeling, and a training-free resolution extrapolation method for arbitrary resolution generation. The paper also explores the use of 2D Rotary Positional Embedding (RoPE) and SwiGLU in the FiT architecture, which enhance resolution generalization and performance. The results demonstrate that FiT achieves state-of-the-art performance across a variety of resolution and aspect ratio settings.This paper introduces the Flexible Vision Transformer (FiT), a transformer architecture designed for diffusion models to generate images at unrestricted resolutions and aspect ratios. Unlike traditional methods that treat images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens, enabling flexible training and inference that adapt to diverse aspect ratios. FiT is enhanced with a carefully adjusted network structure and training-free extrapolation techniques, allowing it to generate images at any resolution without being constrained by predefined dimensions. Comprehensive experiments show that FiT outperforms existing models across various resolutions, demonstrating its effectiveness both within and beyond its training resolution distribution. The FiT-XL/2 model, trained for 1.8 million steps on the ImageNet-256 dataset, achieves state-of-the-art performance across multiple resolutions, including 160×320, 128×384, 224×448, and 160×480. The performance of FiT-XL/2 is further improved with a training-free resolution extrapolation method, surpassing the baseline DiT-XL/2 model at all resolutions except 256×256. FiT's contributions include a flexible training pipeline that eliminates the need for cropping, a unique transformer architecture for dynamic token length modeling, and a training-free resolution extrapolation method for arbitrary resolution generation. The paper also explores the use of 2D Rotary Positional Embedding (RoPE) and SwiGLU in the FiT architecture, which enhance resolution generalization and performance. The results demonstrate that FiT achieves state-of-the-art performance across a variety of resolution and aspect ratio settings.
Reach us at info@study.space
[slides and audio] FiT%3A Flexible Vision Transformer for Diffusion Model