SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Open-world object counting suffers from poor generalization to unseen categories: existing fine-tuning methods model text–image consistency only for seen classes, limiting transferability. This paper proposes a semantic-driven two-stage visual prompt learning framework: (1) initializing prompts via category semantics, and (2) refining them under topological structure guidance; it further enables dynamic synthesis of prompts for unseen categories via semantic similarity. The method integrates vision-language model (VLM) prompt tuning, text encoder knowledge distillation, and a lightweight plug-and-play visual prompt module—introducing negligible additional parameters (<0.1%) and no measurable inference overhead. It achieves zero-shot and few-shot cross-category counting without architectural modification. Extensive experiments demonstrate state-of-the-art performance on FSC-147, CARPK, and PUCPR+, consistently outperforming mainstream open-world counting approaches.

Technology Category

Application Category

📝 Abstract

Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.

Problem

Research questions and friction points this paper is trying to address.

Enhance open-world object counting generalizability for unseen categories

Overcome limited text-image alignment in naive fine-tuning strategies

Dynamic visual prompt synthesis for robust unseen category counting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage visual prompt learning strategy

Dynamic synthesis for unseen categories

Plug-and-play framework with minimal overhead

🔎 Similar Papers

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors