Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning

📅 2024-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt learning methods for zero-shot learning (ZSL) fix visual prompts optimized solely on seen classes, limiting generalization to unseen classes. To address this, we propose a semantic-enhanced visual prompting framework that introduces modality-shared tokens—enabling concept-level cross-modal collaboration for the first time—and a visual residual refinement unit. At the prompt level, we impose semantic consistency supervision by jointly integrating attribute alignment constraints and a semantic-guided attention mechanism, thereby achieving semantics-driven enhancement of visual representations. Our method achieves significant improvements over state-of-the-art approaches on three standard ZSL benchmarks—CUB, SUN, and aWNI—with average accuracy gains of 3.2–5.7 percentage points. These results empirically validate that explicit semantic injection fundamentally strengthens cross-domain zero-shot transfer capability.

Technology Category

Application Category

📝 Abstract
Zero-shot learning (ZSL) endeavors to transfer knowledge from seen categories to recognize unseen categories, which mostly relies on the semantic-visual interactions between image and attribute tokens. Recently, prompt learning has emerged in ZSL and demonstrated significant potential as it allows the zero-shot transfer of diverse visual concepts to downstream tasks. However, current methods explore the fixed adaption of learnable prompt on seen domains, which makes them over-emphasize the primary visual features observed during training, limiting their generalization capabilities to unseen domains. In this work, we propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment, enabling effective knowledge transfer for ZSL. AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision. These are further integrated with primary visual features to attend to semantic-related information for visual enhancement, thus strengthening transferable ability. Experimental results on three benchmarks show that our AENet outperforms existing state-of-the-art ZSL methods. The code is provided in the zip file of supplementary materials.
Problem

Research questions and friction points this paper is trying to address.

Enhancing visual prompts with semantic information for zero-shot learning
Addressing over-emphasis on primary visual features in current methods
Improving generalization to unseen domains through semantic-visual integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-enhanced visual prompt for enrichment
Concept-harmonized tokens across visual-attribute modalities
Visual residual refinement with attribute consistency supervision
M
Man Liu
Institute of Information Science, Beijing Jiaotong University; Beijing Key Laboratory of Advanced Information Science and Network Technology
H
H. Bai
Institute of Information Science, Beijing Jiaotong University; Beijing Key Laboratory of Advanced Information Science and Network Technology
F
Feng Li
Hefei University of Technology
Chunjie Zhang
Chunjie Zhang
Beijing Jiaotong University
multimediacomputer vision
Yunchao Wei
Yunchao Wei
Professor, Beijing Jiaotong University, UTS, UIUC, NUS
Computer VisionMachine Learning
T
Tat-Seng Chua
National University of Singapore
Y
Yao Zhao
Institute of Information Science, Beijing Jiaotong University; Beijing Key Laboratory of Advanced Information Science and Network Technology