WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

This work proposes an efficient contrastive learning framework to address the high computational cost and poor scalability of existing generative approaches for open-vocabulary visual entity recognition. The method leverages large language model embeddings to construct knowledge-rich entity representations and introduces a vision-guided knowledge adapter to achieve fine-grained image-text alignment. To enhance discriminative capability, it incorporates a hard negative sample synthesis mechanism and a patch-level semantic alignment strategy. Evaluated on the OVEN benchmark, the model achieves a 16% absolute accuracy gain on unseen categories while reducing inference latency by nearly two orders of magnitude compared to AutoVER, striking a significant balance between performance and efficiency.

Technology Category

Application Category

📝 Abstract

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/

Problem

Research questions and friction points this paper is trying to address.

open-domain visual entity recognition

computational cost

scalability

practical deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

WikiCLIP

contrastive learning

Vision-Guided Knowledge Adaptor

Hard Negative Synthesis