This paper presents GenPercept, a simple yet effective approach to transfer pre-trained diffusion models for various vision perception tasks. By initializing image understanding models with a pre-trained UNet or transformer from diffusion models, the authors demonstrate that only a moderate amount of target data (even synthetic data) is needed to achieve remarkable transferable performance on fundamental vision perception tasks such as monocular depth estimation, surface normal estimation, image segmentation, matting, and human pose estimation. The backbone of diffusion models, trained on large-scale datasets, exhibits robust generalization capabilities across diverse tasks and real-world datasets.
The authors propose a method to reformulate the input and output of diffusion models, simplifying the stochastic multi-step generation process to a deterministic one-step perception paradigm. This approach allows for efficient and effective transfer learning without requiring extensive fine-tuning. The method is evaluated on various fundamental image perception tasks, showing strong performance and generalization capabilities. The results indicate that pre-trained diffusion models can be effectively fine-tuned with minimal adjustments to perform well on a wide range of vision tasks.
The paper also discusses the limitations of stochastic multi-step generation, such as the conflict with the deterministic nature of perception tasks and the high computational cost of ensemble strategies. The deterministic multi-step generation method is introduced to address these issues, but it still suffers from the persistence of RGB texture in the generated perception predictions. The authors then propose a method that directly uses the pre-trained UNet as initialization, which simplifies the process and improves performance.
The experiments show that GenPercept achieves strong performance on various vision tasks, including geometric estimation, image segmentation, image matting, and human pose estimation. The results demonstrate the effectiveness of using pre-trained diffusion models for vision perception tasks, with the ability to generalize well across different datasets and tasks. The paper concludes that pre-trained diffusion models offer a powerful and efficient approach for vision perception tasks, with the potential for further improvements and applications.This paper presents GenPercept, a simple yet effective approach to transfer pre-trained diffusion models for various vision perception tasks. By initializing image understanding models with a pre-trained UNet or transformer from diffusion models, the authors demonstrate that only a moderate amount of target data (even synthetic data) is needed to achieve remarkable transferable performance on fundamental vision perception tasks such as monocular depth estimation, surface normal estimation, image segmentation, matting, and human pose estimation. The backbone of diffusion models, trained on large-scale datasets, exhibits robust generalization capabilities across diverse tasks and real-world datasets.
The authors propose a method to reformulate the input and output of diffusion models, simplifying the stochastic multi-step generation process to a deterministic one-step perception paradigm. This approach allows for efficient and effective transfer learning without requiring extensive fine-tuning. The method is evaluated on various fundamental image perception tasks, showing strong performance and generalization capabilities. The results indicate that pre-trained diffusion models can be effectively fine-tuned with minimal adjustments to perform well on a wide range of vision tasks.
The paper also discusses the limitations of stochastic multi-step generation, such as the conflict with the deterministic nature of perception tasks and the high computational cost of ensemble strategies. The deterministic multi-step generation method is introduced to address these issues, but it still suffers from the persistence of RGB texture in the generated perception predictions. The authors then propose a method that directly uses the pre-trained UNet as initialization, which simplifies the process and improves performance.
The experiments show that GenPercept achieves strong performance on various vision tasks, including geometric estimation, image segmentation, image matting, and human pose estimation. The results demonstrate the effectiveness of using pre-trained diffusion models for vision perception tasks, with the ability to generalize well across different datasets and tasks. The paper concludes that pre-trained diffusion models offer a powerful and efficient approach for vision perception tasks, with the potential for further improvements and applications.