π€ AI Summary
This work addresses the challenge of achieving semantic-level understanding from heterogeneous remote sensing data, such as hyperspectral imaging (HSI) and LiDAR, by proposing MMLGNetβa novel framework that introduces language-guided contrastive learning for multimodal alignment in remote sensing. The method employs modality-specific encoders to extract visual features and integrates handcrafted textual embeddings to align vision and language representations in a shared latent space through bidirectional contrastive learning. Inspired by CLIP but utilizing only lightweight CNN encoders, MMLGNet demonstrates significant performance gains over existing purely visual multimodal approaches on two standard remote sensing benchmarks, thereby validating the effectiveness and potential of language supervision in enhancing semantic interpretation of remote sensing data.
π Abstract
In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at https://github.com/AdityaChaudhary2913/CLIP_HSI.