Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

πŸ“… 2026-02-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

164K/year
πŸ€– AI Summary
Existing vision-language foundation models struggle to effectively leverage multispectral information in remote sensing tasks and exhibit limited semantic expressiveness when constrained to RGB inputs, thereby hindering zero-shot classification and retrieval performance. To address this limitation, this work proposes SATtxt, a framework that distills spectral priors from a multispectral teacher model into a lightweight RGB student model through a two-stage training process, while aligning the student’s visual representations with the embedding space of an instruction-augmented large language model. This approach enables efficient, fine-grained semantic understanding of remote sensing imagery using only RGB inputs. Experimental results demonstrate consistent improvements across EuroSAT, BigEarthNet, and ForestNet, with average gains of 4.2%, 5.9%, and 2.7% in zero-shot classification, retrieval, and linear probing, respectively.

Technology Category

Application Category

πŸ“ Abstract
Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/
Problem

Research questions and friction points this paper is trying to address.

satellite imagery
vision-language foundation models
multi-spectral representation
semantic alignment
zero-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral Representation Distillation
Instruction-Augmented LLMs
Vision-Language Foundation Model
RGB-only Inference
Spectrally Grounded Alignment