15 Mar 2024 | Guangkai Xu Yongtao Ge Mingyu Liu Chengxiang Fan Kangyang Xie Zhiyue Zhao Hao Chen Chunhua Shen
The paper "Diffusion Models Trained with Large Data Are Transferable Visual Models" by Guangkai Xu et al. explores the use of pre-trained diffusion models, specifically UNets or transformers, for various fundamental vision perception tasks. The authors demonstrate that these models, trained on large datasets like LAION-5B, can achieve remarkable transferable performance with minimal fine-tuning on target datasets, including monocular depth estimation, surface normal estimation, image segmentation, matting, and human pose estimation. Unlike previous works that reformulate these tasks to align with the diffusion process, the authors propose a simpler approach by fine-tuning the models with minimal adjustments, which is both simpler and faster. The paper highlights the robust generalization capabilities of the diffusion backbone and provides extensive quantitative and qualitative experiments to support these findings. The key contributions include the introduction of GenPercept, a paradigm that leverages pre-trained UNets for downstream image understanding tasks, and the demonstration of the model's effectiveness on diverse visual tasks with limited fine-tuning data.The paper "Diffusion Models Trained with Large Data Are Transferable Visual Models" by Guangkai Xu et al. explores the use of pre-trained diffusion models, specifically UNets or transformers, for various fundamental vision perception tasks. The authors demonstrate that these models, trained on large datasets like LAION-5B, can achieve remarkable transferable performance with minimal fine-tuning on target datasets, including monocular depth estimation, surface normal estimation, image segmentation, matting, and human pose estimation. Unlike previous works that reformulate these tasks to align with the diffusion process, the authors propose a simpler approach by fine-tuning the models with minimal adjustments, which is both simpler and faster. The paper highlights the robust generalization capabilities of the diffusion backbone and provides extensive quantitative and qualitative experiments to support these findings. The key contributions include the introduction of GenPercept, a paradigm that leverages pre-trained UNets for downstream image understanding tasks, and the demonstration of the model's effectiveness on diverse visual tasks with limited fine-tuning data.