[slides and audio] Denoising Vision Transformers

The paper "Denoising Vision Transformers" addresses the issue of noise artifacts in the feature maps of Vision Transformers (ViTs), which can degrade performance in downstream dense prediction tasks such as semantic segmentation, depth estimation, and object detection. The authors identify positional embeddings as the primary source of these artifacts and propose a two-stage denoising approach called Denoising Vision Transformers (DVT). In the first stage, DVT uses neural fields to separate clean features from those contaminated by positional artifacts on a per-image basis. In the second stage, a lightweight transformer block is trained to predict clean features from raw ViT outputs, leveraging the cleaned features as supervision. DVT does not require retraining pre-trained ViTs and can be integrated into existing models. The method is evaluated on several representative ViTs (DINO, DeiT-III, EVA-02, CLIP, DINOv2, DINOv2-reg) and shows significant improvements in performance across various dense prediction tasks. The authors also provide an analysis of the impact of positional embeddings and discuss the limitations and future directions of their work.The paper "Denoising Vision Transformers" addresses the issue of noise artifacts in the feature maps of Vision Transformers (ViTs), which can degrade performance in downstream dense prediction tasks such as semantic segmentation, depth estimation, and object detection. The authors identify positional embeddings as the primary source of these artifacts and propose a two-stage denoising approach called Denoising Vision Transformers (DVT). In the first stage, DVT uses neural fields to separate clean features from those contaminated by positional artifacts on a per-image basis. In the second stage, a lightweight transformer block is trained to predict clean features from raw ViT outputs, leveraging the cleaned features as supervision. DVT does not require retraining pre-trained ViTs and can be integrated into existing models. The method is evaluated on several representative ViTs (DINO, DeiT-III, EVA-02, CLIP, DINOv2, DINOv2-reg) and shows significant improvements in performance across various dense prediction tasks. The authors also provide an analysis of the impact of positional embeddings and discuss the limitations and future directions of their work.

Denoising Vision Transformers

22 Jul 2024 | Jiawei Yang*,†,1,*, Katie Z Luo*,2 Jiefeng Li3 Congyue Deng4 Leonidas Guibas4 Dilip Krishnan5 Kilian Q Weinberger2 Yonglong Tian5 Yue Wang1

22 Jul 2024 | Jiawei Yang,†,1,, Katie Z Luo*,2 Jiefeng Li3 Congyue Deng4 Leonidas Guibas4 Dilip Krishnan5 Kilian Q Weinberger2 Yonglong Tian5 Yue Wang1