Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt learning methods for vision-language models predominantly rely on first-order visual features, which struggle to handle domain shifts and local noise, thereby limiting generalization. This work proposes Gram-Anchored Prompt Learning (GAPL), the first approach to incorporate second-order statistical information—derived from Gram matrices—into prompt learning. GAPL constructs a global structure-aware feature flow and dynamically fuses it with first-order spatial features to align language prompts with the underlying visual distribution. By anchoring textual prompts to global structural priors, the method jointly optimizes semantic alignment and structural consistency. Extensive experiments demonstrate that GAPL significantly enhances model robustness and generalization across multiple cross-domain benchmarks.
📝 Abstract
Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Prompt Learning
Second-Order Statistics
Domain Shift
Feature Robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gram matrix
second-order statistics
prompt learning
vision-language models
domain robustness
🔎 Similar Papers
No similar papers found.
M
Minglei Chen
Southwestern University of Finance and Economics, Chengdu, Sichuan, China
Weilong Wang
Weilong Wang
PhD Student, Purdue University
Information System
J
Jiang Duan
Southwestern University of Finance and Economics, Chengdu, Sichuan, China
Ye Deng
Ye Deng
Southwestern University of Finance and Economics
computer visionmachine learning