FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current medical vision-language pretraining (VLP) methods face two key challenges: false negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these, we propose a refined alignment framework for medical image-text understanding: (1) semantic-aware positive sample mining to mitigate FaNe; (2) text-guided sparse attention pooling to strengthen local–global semantic alignment; and (3) adaptive reweighting of hard negative contrastive loss to enhance discriminative capability. Evaluated on MIMIC-CXR and RadGraph, our method achieves state-of-the-art performance across five downstream tasks—including image classification, object detection, and semantic segmentation—demonstrating substantial improvements in cross-modal representation quality and clinical interpretability.

Technology Category

Application Category

📝 Abstract
Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.
Problem

Research questions and friction points this paper is trying to address.

Reduces false negatives in cross-modal contrastive learning for medical vision-language pre-training
Enables fine-grained image-text alignment through text-conditioned sparse attention pooling
Improves intra-modal discrimination with adaptive hard-negative aware contrastive loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware positive pair mining reduces false negatives
Text-conditioned sparse attention enables fine-grained image alignment
Hard-negative aware contrastive loss strengthens intra-modal discrimination
🔎 Similar Papers
No similar papers found.
P
Peng Zhang
College of Computer Science and Software Engineering, Shenzhen University
Zhihui Lai
Zhihui Lai
Shenzhen University
W
Wenting Chen
Department of Radiation Oncology, Stanford University
X
Xu Wu
College of Computer Science and Software Engineering, Shenzhen University
H
Heng Kong
Department of Thyroid and Breast Surgery, Affiliated Hospital Group of Guangdong Medical University, Shenzhen Baoan Central Hospital