🤖 AI Summary
To address the high computational complexity and insufficient fine-grained semantic modeling in change captioning for multi-temporal remote sensing imagery, this paper proposes an end-to-end, single-stage change captioning framework. We introduce a spatial-channel joint attention encoder and a difference-guided fusion module, integrated with cosine-similarity-driven lightweight feature fusion and a difference-aware Transformer decoder. This design enables efficient and precise semantic difference modeling while maintaining low inference overhead. Evaluated on LEVIR-CC and DUBAI-CC benchmarks, our method achieves CIDEr scores of 140.23 and 97.74, respectively—substantially outperforming state-of-the-art approaches. The results demonstrate significant improvements in both computational efficiency and caption quality, validating the effectiveness of our architecture for accurate, scalable change description generation.
📝 Abstract
Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.