Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

πŸ“… 2026-02-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language foundation models struggle to effectively leverage multispectral information in remote sensing tasks and exhibit limited semantic expressiveness when constrained to RGB inputs, thereby hindering zero-shot classification and retrieval performance. To address this limitation, this work proposes SATtxt, a framework that distills spectral priors from a multispectral teacher model into a lightweight RGB student model through a two-stage training process, while aligning the student’s visual representations with the embedding space of an instruction-augmented large language model. This approach enables efficient, fine-grained semantic understanding of remote sensing imagery using only RGB inputs. Experimental results demonstrate consistent improvements across EuroSAT, BigEarthNet, and ForestNet, with average gains of 4.2%, 5.9%, and 2.7% in zero-shot classification, retrieval, and linear probing, respectively.

Technology Category

Application Category

πŸ“ Abstract
Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/
Problem

Research questions and friction points this paper is trying to address.

satellite imagery
vision-language foundation models
multi-spectral representation
semantic alignment
zero-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral Representation Distillation
Instruction-Augmented LLMs
Vision-Language Foundation Model
RGB-only Inference
Spectrally Grounded Alignment
πŸ”Ž Similar Papers
No similar papers found.
M
Minh Kha Do
La Trobe University, Melbourne, VIC 3086, Australia
Wei Xiang
Wei Xiang
Distinguished Professor, Cisco Research Chair of AI and IoT, La Trobe University
Internet of ThingsMachine LearningWireless Sensor NetworksWireless CommunicationsComputer
K
Kang Han
La Trobe University, Melbourne, VIC 3086, Australia
D
Di Wu
La Trobe University, Melbourne, VIC 3086, Australia
K
Khoa Phan
La Trobe University, Melbourne, VIC 3086, Australia
Yi-Ping Phoebe Chen
Yi-Ping Phoebe Chen
Professor of Computer Science, La Trobe University, Melbourne, Australia
MultimediaBioinformaticsArtificial IntelligenceData Mining and Machine LearningPattern Recognition and Knowledge Discove
Gaowen Liu
Gaowen Liu
Cisco Research
machine learningcomputer visionmultimedia.
R
Ramana Rao Kompella
Cisco Research, San Jose, CA, USA