KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal entity linking (MEL) approaches commonly neglect the structural information encoded in knowledge graph (KG) triples, leading to inaccurate entity alignment under semantic ambiguity. To address this, we propose a novel generation–retrieval–reranking three-stage framework that explicitly incorporates KG triples into MEL for the first time. Specifically, we leverage a vision-language model (VLM) to generate candidate KG triples; employ contrastive learning to jointly embed textual, visual, and KG modalities; and utilize a large language model (LLM) to refine generated triples and rerank candidates. This paradigm achieves deep integration of heterogeneous multimodal signals. Our method consistently outperforms state-of-the-art approaches across multiple benchmarks. The source code and datasets are publicly available.

Technology Category

Application Category

📝 Abstract
Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.
Problem

Research questions and friction points this paper is trying to address.

Enhance multimodal entity linking using knowledge graphs
Integrate text, images, and KG triples for entity alignment
Improve accuracy by generating and refining KG triples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages KG triples for multimodal entity linking
Uses vision-language models to generate triples
Integrates text, images, and triples via contrastive learning
🔎 Similar Papers
No similar papers found.