Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing prompt learning methods for vision-language models predominantly rely on first-order visual features, which struggle to handle domain shifts and local noise, thereby limiting generalization. This work proposes Gram-Anchored Prompt Learning (GAPL), the first approach to incorporate second-order statistical information—derived from Gram matrices—into prompt learning. GAPL constructs a global structure-aware feature flow and dynamically fuses it with first-order spatial features to align language prompts with the underlying visual distribution. By anchoring textual prompts to global structural priors, the method jointly optimizes semantic alignment and structural consistency. Extensive experiments demonstrate that GAPL significantly enhances model robustness and generalization across multiple cross-domain benchmarks.

📝 Abstract

Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Prompt Learning

Second-Order Statistics

Domain Shift

Feature Robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gram matrix

second-order statistics

prompt learning

vision-language models

domain robustness

🔎 Similar Papers

No similar papers found.

Authors to Follow