FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current medical vision-language pretraining (VLP) methods face two key challenges: false negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these, we propose a refined alignment framework for medical image-text understanding: (1) semantic-aware positive sample mining to mitigate FaNe; (2) text-guided sparse attention pooling to strengthen local–global semantic alignment; and (3) adaptive reweighting of hard negative contrastive loss to enhance discriminative capability. Evaluated on MIMIC-CXR and RadGraph, our method achieves state-of-the-art performance across five downstream tasks—including image classification, object detection, and semantic segmentation—demonstrating substantial improvements in cross-modal representation quality and clinical interpretability.

Technology Category

Application Category

📝 Abstract

Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.

Problem

Research questions and friction points this paper is trying to address.

Reduces false negatives in cross-modal contrastive learning for medical vision-language pre-training

Enables fine-grained image-text alignment through text-conditioned sparse attention pooling

Improves intra-modal discrimination with adaptive hard-negative aware contrastive loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware positive pair mining reduces false negatives

Text-conditioned sparse attention enables fine-grained image alignment

Hard-negative aware contrastive loss strengthens intra-modal discrimination

🔎 Similar Papers

No similar papers found.

Authors to Follow