🤖 AI Summary
Existing vehicle make and model recognition (VMMR) methods suffer from poor generalization to newly released models, while CLIP-based zero-shot approaches are constrained by fixed visual encoders and require costly image-level fine-tuning. To address these limitations, we propose a fine-tuning-free zero-shot VMMR paradigm: first, a vision-language model (VLM) parses fine-grained visual attributes (e.g., grille design, headlight configuration, silhouette) from input vehicle images; second, retrieval-augmented generation (RAG) retrieves candidate textual descriptions from a structured automotive knowledge base; finally, a large language model (LLM) performs semantic alignment and reasoning to identify the make and model. This pipeline eliminates end-to-end retraining and enables instantaneous incorporation of unseen models via textual specifications alone. On standard benchmarks, our method achieves a 19.7% absolute accuracy gain over CLIP-based baselines, significantly improving scalability and deployment efficiency—offering a novel pathway for dynamic, real-time vehicle identification in smart city applications.
📝 Abstract
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.