Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing vehicle make and model recognition (VMMR) methods suffer from poor generalization to newly released models, while CLIP-based zero-shot approaches are constrained by fixed visual encoders and require costly image-level fine-tuning. To address these limitations, we propose a fine-tuning-free zero-shot VMMR paradigm: first, a vision-language model (VLM) parses fine-grained visual attributes (e.g., grille design, headlight configuration, silhouette) from input vehicle images; second, retrieval-augmented generation (RAG) retrieves candidate textual descriptions from a structured automotive knowledge base; finally, a large language model (LLM) performs semantic alignment and reasoning to identify the make and model. This pipeline eliminates end-to-end retraining and enables instantaneous incorporation of unseen models via textual specifications alone. On standard benchmarks, our method achieves a 19.7% absolute accuracy gain over CLIP-based baselines, significantly improving scalability and deployment efficiency—offering a novel pathway for dynamic, real-time vehicle identification in smart city applications.

Technology Category

Application Category

📝 Abstract

Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.

Problem

Research questions and friction points this paper is trying to address.

Recognizing new vehicle models without costly retraining or fine-tuning

Improving zero-shot recognition accuracy beyond CLIP baseline limitations

Enabling scalable vehicle identification through text-based reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates vision language models with retrieval-augmented generation

Converts vehicle images into descriptive textual attributes

Uses retrieved text features to infer make and model

🔎 Similar Papers

No similar papers found.