OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address semantic alignment imbalance in vision-language cross-modal retrieval caused by inter-modal entropy disparity, this paper proposes the Entropy-enhanced Hypergraph Alignment (EHA) framework. Methodologically: (1) it leverages large language models’ open-world semantic knowledge with lightweight prompt templates to enrich textual polysemy, mitigating semantic impoverishment in low-entropy text; (2) it introduces a hypergraph adapter to model multi-relational, multi-granular semantic associations between images and text—replacing conventional pairwise embedding constraints; and (3) it jointly optimizes cross-modal entropy consistency and structure-aware alignment losses. Without requiring domain-specific annotations or handcrafted rules, EHA achieves state-of-the-art performance on Flickr30K and MS-COCO: improving text-to-image retrieval R@1 by 16.8% and image-to-text R@1 by 40.1%, significantly narrowing the modality gap while suppressing open-world semantic noise.

Technology Category

Application Category

📝 Abstract

Text-image alignment constitutes a foundational challenge in multimedia content understanding, where effective modeling of cross-modal semantic correspondences critically enhances retrieval system performance through joint embedding space optimization. Given the inherent difference in information entropy between texts and images, conventional approaches often show an imbalance in the mutual retrieval of these two modalities. To address this particular challenge, we propose to use the open semantic knowledge of Large Language Model (LLM) to fill for the entropy gap and reproduce the alignment ability of humans in these tasks. Our entropy-enhancing alignment is achieved through a two-step process: 1) a new prompt template that does not rely on explicit knowledge in the task domain is designed to use LLM to enhance the polysemy description of the text modality. By analogy, the information entropy of the text modality relative to the visual modality is increased; 2) A hypergraph adapter is used to construct multilateral connections between the text and image modalities, which can correct the positive and negative matching errors for synonymous semantics in the same fixed embedding space, whilst reducing the noise caused by open semantic entropy by mapping the reduced dimensions back to the original dimensions. Comprehensive evaluations on the Flickr30K and MS-COCO benchmarks validate the superiority of our Open Semantic Hypergraph Adapter (OS-HGAdapter), showcasing 16.8% (text-to-image) and 40.1% (image-to-text) cross-modal retrieval gains over existing methods while establishing new state-of-the-art performance in semantic alignment tasks.

Problem

Research questions and friction points this paper is trying to address.

Addresses text-image alignment imbalance in cross-modal retrieval systems

Uses LLM to enhance text entropy and bridge semantic gaps

Corrects matching errors through hypergraph adaptation in embedding space

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM enhances text polysemy to bridge entropy gap

Hypergraph adapter constructs multilateral cross-modal connections

Dimensionality reduction mapping reduces open semantic noise

🔎 Similar Papers

No similar papers found.