Change Captioning in Remote Sensing: Evolution to SAT-Cap -- A Single-Stage Transformer Approach

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational complexity and insufficient fine-grained semantic modeling in change captioning for multi-temporal remote sensing imagery, this paper proposes an end-to-end, single-stage change captioning framework. We introduce a spatial-channel joint attention encoder and a difference-guided fusion module, integrated with cosine-similarity-driven lightweight feature fusion and a difference-aware Transformer decoder. This design enables efficient and precise semantic difference modeling while maintaining low inference overhead. Evaluated on LEVIR-CC and DUBAI-CC benchmarks, our method achieves CIDEr scores of 140.23 and 97.74, respectively—substantially outperforming state-of-the-art approaches. The results demonstrate significant improvements in both computational efficiency and caption quality, validating the effectiveness of our architecture for accurate, scalable change description generation.

Technology Category

Application Category

📝 Abstract
Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth's dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model's ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.
Problem

Research questions and friction points this paper is trying to address.

Remote Sensing
Change Detection
Computational Complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAT-Cap
Transformers
Remote Sensing Image Change Description
🔎 Similar Papers
No similar papers found.