FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

In vision-language pretraining, many-to-many image-text matching introduces false negatives, causing conflicting supervision signals, degraded embedding spaces, and failure of hard negative mining. To address this, we propose an adaptive mini-batch construction strategy that dynamically balances hard negative and false negative sampling. Furthermore, we introduce the first proxy-signal-driven negative mining scheduler, where the proxy signal is derived from cross-modal alignment quality; this enables anchor-level hardness-aware negative selection, eliminating reliance on fixed heuristic rules. Our method integrates seamlessly into ALBEF and BLIP-2 without architectural modification. Extensive experiments demonstrate consistent improvements across downstream tasks—including image/text retrieval and visual question answering—on multiple benchmarks, validating its robustness and generalization capability.

Technology Category

Application Category

📝 Abstract

False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across two widely adopted VLP frameworks (ALBEF, BLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.

Problem

Research questions and friction points this paper is trying to address.

Addresses false negatives in vision-language pretraining datasets

Balances trade-off between hard and false negative samples

Improves cross-modal alignment in vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive mini-batch construction strategy for VLP

Dynamic negative mining scheduler for hardness balance

Proxy-guided cross-modal alignment improvement

🔎 Similar Papers

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning