Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts

📅 2025-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal knowledge graph completion (MMKGC) methods suffer from low efficiency, difficulty in cross-modal information fusion, reliance on explicit entity embeddings, and excessive model complexity. Method: This paper proposes an entity-embedding-free sequence-to-sequence generation framework. Its core innovation is the first use of a pre-trained vision-language model (VLM) to generate link-aware cross-modal contextual representations—encompassing target entities and their neighborhood—and to deeply integrate these representations into a lightweight Transformer encoder-decoder for end-to-end fine-tuning. This design avoids both the embedding bottleneck of traditional knowledge graph embedding (KGE) approaches and the high computational cost of full-parameter VLM fine-tuning. Contribution/Results: The method achieves state-of-the-art (SOTA) or near-SOTA performance across multiple large-scale MMKG benchmarks, reduces model parameters by over 60%, exhibits low hyperparameter sensitivity, and demonstrates strong generalization capability.

Technology Category

Application Category

📝 Abstract
Multimodal knowledge graph completion (MMKGC) aims to predict missing links in multimodal knowledge graphs (MMKGs) by leveraging information from various modalities alongside structural data. Existing MMKGC approaches primarily extend traditional knowledge graph embedding (KGE) models, which often require creating an embedding for every entity. This results in large model sizes and inefficiencies in integrating multimodal information, particularly for real-world graphs. Meanwhile, Transformer-based models have demonstrated competitive performance in knowledge graph completion (KGC). However, their focus on single-modal knowledge limits their capacity to utilize cross-modal information. Recently, Large vision-language models (VLMs) have shown potential in cross-modal tasks but are constrained by the high cost of training. In this work, we propose a novel approach that integrates Transformer-based KGE models with cross-modal context generated by pre-trained VLMs, thereby extending their applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform relevant visual information from entities and their neighbors into textual sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the model with the generated cross-modal context. This simple yet effective method significantly reduces model size compared to traditional KGE approaches while achieving competitive performance across multiple large-scale datasets with minimal hyperparameter tuning.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Information
Transformer Models
Knowledge Graph Completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer
Multimodal Knowledge Graph Completion
Pre-trained Visual Language Model
🔎 Similar Papers
No similar papers found.