Denoising Vision Transformers

Denoising Vision Transformers

22 Jul 2024 | Jiawei Yang*,†,1,* Katie Z Luo*,2 Jiefeng Li3 Congyue Deng4 Leonidas Guibas4 Dilip Krishnan5 Kilian Q Weinberger2 Yonglong Tian5 Yue Wang1
Denoising Vision Transformers (DVT) address the issue of grid-like artifacts in Vision Transformers (ViTs) that degrade performance in dense prediction tasks. These artifacts, caused by positional embeddings, are mitigated through a two-stage denoising approach. In the first stage, cross-view feature consistency is enforced to extract artifact-free features. In the second stage, a lightweight transformer block is trained to predict clean features from raw ViT outputs. DVT does not require retraining pre-trained ViTs and is applicable to any Vision Transformer architecture. Evaluations on various ViTs show consistent performance improvements in semantic and geometric tasks. The method effectively removes artifacts, enhancing feature interpretability and semantic coherence. DVT is implemented as a two-stage denoising pipeline, with the first stage decomposing raw features into noise-free semantics, artifact-related terms, and residual terms. The second stage trains a denoiser to predict clean features. The denoiser is integrated into pre-trained ViTs without extensive retraining, enabling real-time applications. Experiments demonstrate significant improvements in dense prediction tasks such as semantic segmentation, depth estimation, object detection, and object discovery. DVT also enhances the performance of models like DINOv2-reg, achieving better results than previous methods. The approach is efficient, scalable, and generalizable, with results showing that artifact removal is the primary factor in performance gains. The study highlights the importance of positional embeddings in ViT design and encourages a re-evaluation of ViT architecture.Denoising Vision Transformers (DVT) address the issue of grid-like artifacts in Vision Transformers (ViTs) that degrade performance in dense prediction tasks. These artifacts, caused by positional embeddings, are mitigated through a two-stage denoising approach. In the first stage, cross-view feature consistency is enforced to extract artifact-free features. In the second stage, a lightweight transformer block is trained to predict clean features from raw ViT outputs. DVT does not require retraining pre-trained ViTs and is applicable to any Vision Transformer architecture. Evaluations on various ViTs show consistent performance improvements in semantic and geometric tasks. The method effectively removes artifacts, enhancing feature interpretability and semantic coherence. DVT is implemented as a two-stage denoising pipeline, with the first stage decomposing raw features into noise-free semantics, artifact-related terms, and residual terms. The second stage trains a denoiser to predict clean features. The denoiser is integrated into pre-trained ViTs without extensive retraining, enabling real-time applications. Experiments demonstrate significant improvements in dense prediction tasks such as semantic segmentation, depth estimation, object detection, and object discovery. DVT also enhances the performance of models like DINOv2-reg, achieving better results than previous methods. The approach is efficient, scalable, and generalizable, with results showing that artifact removal is the primary factor in performance gains. The study highlights the importance of positional embeddings in ViT design and encourages a re-evaluation of ViT architecture.
Reach us at info@study.space