Remote Sensing Image Classification Using Deep Ensemble Learning

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge in remote sensing image classification where single models struggle to simultaneously capture local details and global contextual information, and naive stacking of CNNs and Vision Transformers (ViTs) often leads to feature redundancy and performance bottlenecks. To overcome this, the authors propose a multi-model ensemble architecture that integrates CNN and ViT through four complementary fusion models, followed by decision-level ensembling to effectively mitigate redundancy and enhance discriminative capability. The proposed method achieves state-of-the-art accuracy rates of 98.10%, 94.46%, and 95.45% on the UC Merced, RSSCN7, and MSRSI datasets, respectively, significantly outperforming existing mainstream approaches while maintaining computational efficiency and delivering higher classification accuracy.

Technology Category

Application Category

📝 Abstract

Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.

Problem

Research questions and friction points this paper is trying to address.

Remote Sensing Image Classification

Convolutional Neural Networks

Vision Transformers

Feature Redundancy

Performance Bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Ensemble Learning

CNN-ViT Fusion

Remote Sensing Image Classification

Feature Redundancy Mitigation

Vision Transformers

🔎 Similar Papers

No similar papers found.

Authors to Follow