Cross Modal Fine-grained Alignment via Granularity-aware and Region-uncertain Modeling

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained image–text alignment faces challenges including inaccurate local region–word correspondence, high attention noise, and difficulty modeling one-to-many or many-to-one relationships. To address these, we propose a granularity-aware fine-grained alignment framework. First, we introduce modality-specific saliency modeling to independently assess the importance of visual regions and textual tokens. Second, we explicitly model regional uncertainty using a Gaussian mixture distribution, relaxing the conventional one-to-one matching assumption. Third, we integrate cross-modal contrastive learning for end-to-end optimization. Our method achieves state-of-the-art performance on Flickr30K and MS-COCO, demonstrates compatibility with diverse backbone architectures, and significantly enhances robustness and interpretability of alignment—particularly in complex, cluttered scenes.

Technology Category

Application Category

📝 Abstract
Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.
Problem

Research questions and friction points this paper is trying to address.

Addresses noisy attention mechanisms in fine-grained image-text alignment
Overcomes lack of robust intra-modal significance assessment mechanisms
Solves absence of fine-grained uncertainty modeling for region-word correspondences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Granularity-aware modeling for cross-modal alignment
Region uncertainty modeling with Gaussian distributions
Modality-specific biases for robust feature identification
🔎 Similar Papers
No similar papers found.
J
Jiale Liu
South China Normal University, Guangzhou
H
Haoming Zhou
South China Normal University, Guangzhou
Y
Yishu Zhu
Harbin Institute of Technology, Shenzhen
Bingzhi Chen
Bingzhi Chen
Harbin Institute of Technology, Shenzhen
Medical Image Analysis
Yuncheng Jiang
Yuncheng Jiang
West China Hospital, Sichuan University
Computer VisionMedical Image Analysis