11 Jan 2024 | Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai
The paper introduces Deformable Convolution v4 (DCNv4), an efficient and effective operator designed for various vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, by removing softmax normalization in spatial aggregation to enhance dynamic property and expressive power, and optimizing memory access to minimize redundant operations. These improvements result in significantly faster convergence and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed compared to DCNv3. DCNv4 demonstrates exceptional performance across tasks such as image classification, instance and semantic segmentation, and image generation. When integrated into generative models like U-Net in latent diffusion models, DCNv4 outperforms baselines, highlighting its potential to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model results in up to 80% speed increase without further modifications. The paper also explores the potential of DCNv4 in other modern backbone architectures, including ConvNeXt and ViT, showing its versatility and efficiency. The implementation of DCNv4 is available on GitHub, aiming to facilitate future research in the vision community.The paper introduces Deformable Convolution v4 (DCNv4), an efficient and effective operator designed for various vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, by removing softmax normalization in spatial aggregation to enhance dynamic property and expressive power, and optimizing memory access to minimize redundant operations. These improvements result in significantly faster convergence and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed compared to DCNv3. DCNv4 demonstrates exceptional performance across tasks such as image classification, instance and semantic segmentation, and image generation. When integrated into generative models like U-Net in latent diffusion models, DCNv4 outperforms baselines, highlighting its potential to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model results in up to 80% speed increase without further modifications. The paper also explores the potential of DCNv4 in other modern backbone architectures, including ConvNeXt and ViT, showing its versatility and efficiency. The implementation of DCNv4 is available on GitHub, aiming to facilitate future research in the vision community.