Understanding Remote Sensing Image Change Detection With Transformers

The paper introduces a novel method called Bitemporal Image Transformer (BIT) for change detection (CD) in high-resolution remote sensing images. The main challenge in CD is the complexity of objects and varying imaging conditions, which make it difficult to identify changes of interest while distinguishing them from irrelevant changes. Traditional deep convolutional neural networks (CNNs) have shown good performance in CD tasks, but they struggle with long-range context modeling in space-time. Non-local self-attention approaches are computationally inefficient and do not fully exploit temporal context. To address these challenges, the authors propose BIT, which models contexts within the spatial-temporal domain using a transformer encoder. The intuition behind BIT is that high-level concepts of interest can be represented by a few visual words or semantic tokens. The input bitemporal image is transformed into a compact set of semantic tokens, and the transformer encoder models the context in a token-based space-time. The resulting context-rich tokens are then projected back to the pixel space to refine the original features. The model is integrated into a deep feature differencing-based CD framework, and extensive experiments on three datasets demonstrate its effectiveness and efficiency. Notably, the BIT-based model outperforms purely convolutional baselines with significantly lower computational costs and model parameters. The code for the proposed method is available at https://github.com/justchenhao/BIT_CD.The paper introduces a novel method called Bitemporal Image Transformer (BIT) for change detection (CD) in high-resolution remote sensing images. The main challenge in CD is the complexity of objects and varying imaging conditions, which make it difficult to identify changes of interest while distinguishing them from irrelevant changes. Traditional deep convolutional neural networks (CNNs) have shown good performance in CD tasks, but they struggle with long-range context modeling in space-time. Non-local self-attention approaches are computationally inefficient and do not fully exploit temporal context. To address these challenges, the authors propose BIT, which models contexts within the spatial-temporal domain using a transformer encoder. The intuition behind BIT is that high-level concepts of interest can be represented by a few visual words or semantic tokens. The input bitemporal image is transformed into a compact set of semantic tokens, and the transformer encoder models the context in a token-based space-time. The resulting context-rich tokens are then projected back to the pixel space to refine the original features. The model is integrated into a deep feature differencing-based CD framework, and extensive experiments on three datasets demonstrate its effectiveness and efficiency. Notably, the BIT-based model outperforms purely convolutional baselines with significantly lower computational costs and model parameters. The code for the proposed method is available at https://github.com/justchenhao/BIT_CD.

Remote Sensing Image Change Detection with Transformers

11 Jul 2021 | Hao Chen, Zipeng Qi and Zhenwei Shi*