🤖 AI Summary
In vision-language pretraining, many-to-many image-text matching introduces false negatives, causing conflicting supervision signals, degraded embedding spaces, and failure of hard negative mining. To address this, we propose an adaptive mini-batch construction strategy that dynamically balances hard negative and false negative sampling. Furthermore, we introduce the first proxy-signal-driven negative mining scheduler, where the proxy signal is derived from cross-modal alignment quality; this enables anchor-level hardness-aware negative selection, eliminating reliance on fixed heuristic rules. Our method integrates seamlessly into ALBEF and BLIP-2 without architectural modification. Extensive experiments demonstrate consistent improvements across downstream tasks—including image/text retrieval and visual question answering—on multiple benchmarks, validating its robustness and generalization capability.
📝 Abstract
False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across two widely adopted VLP frameworks (ALBEF, BLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.