Remote Sensing Image Classification Using Deep Ensemble Learning

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in remote sensing image classification where single models struggle to simultaneously capture local details and global contextual information, and naive stacking of CNNs and Vision Transformers (ViTs) often leads to feature redundancy and performance bottlenecks. To overcome this, the authors propose a multi-model ensemble architecture that integrates CNN and ViT through four complementary fusion models, followed by decision-level ensembling to effectively mitigate redundancy and enhance discriminative capability. The proposed method achieves state-of-the-art accuracy rates of 98.10%, 94.46%, and 95.45% on the UC Merced, RSSCN7, and MSRSI datasets, respectively, significantly outperforming existing mainstream approaches while maintaining computational efficiency and delivering higher classification accuracy.

Technology Category

Application Category

📝 Abstract
Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.
Problem

Research questions and friction points this paper is trying to address.

Remote Sensing Image Classification
Convolutional Neural Networks
Vision Transformers
Feature Redundancy
Performance Bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Ensemble Learning
CNN-ViT Fusion
Remote Sensing Image Classification
Feature Redundancy Mitigation
Vision Transformers
🔎 Similar Papers
No similar papers found.
Niful Islam
Niful Islam
Graduate Student, Oakland University
Computer VisionMachine LearningNLP
M
Md. Rayhan Ahmed
Irving K. Barber Faculty of Science, University of British Columbia, Kelowna, Canada
N
Nur Mohammad Fahad
Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
Salekul Islam
Salekul Islam
Professor, Electrical and Computer Engineering Department, North South University
A
A. K. M. Muzahidul Islam
Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
Saddam Mukta
Saddam Mukta
Post-doctoral Researcher, LUT University
#LUTsoftwareArtificial IntelligenceMachine LearningNLPSocial Network Mining
Swakkhar Shatabda
Swakkhar Shatabda
Professor, School of Data and Sciences, BRAC University
optimizationmachine learningcomputational biologybioinformatics