Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the degradation of model generalization caused by noisy cross-modal correspondences in large-scale web-collected data. To this end, the authors propose the IN2R framework, which abandons the conventional discrete label selection paradigm and introduces a novel continuous soft prototype synthesis mechanism grounded in intra-modal neighborhood consensus. Specifically, IN2R dynamically retrieves intra-modal neighbors via a cross-modal memory bank and employs a graph refiner to perform relational reasoning, thereby generating locally semantically consistent soft supervision signals. This approach effectuates a paradigm shift from “proxy labels” to “synthesized supervision.” Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN2R significantly outperforms state-of-the-art methods, confirming its effectiveness and robustness in cross-modal retrieval.

📝 Abstract

Large-scale web-harvested datasets have fueled the progress of cross-modal retrieval but inevitably suffer from noisy correspondence, which severely degrades model generalization. Existing methods primarily address this by filtering out noise or seeking a substitute label, yet they predominantly remain bound by a "Discrete Selection" paradigm. We argue that relying on a single discrete proxy induces Single-Point Fragility and Discretization Error. To overcome these limitations, we propose a novel framework, Intra-modal Neighbor-aware Noise Rectification (IN2R), which shifts the paradigm from searching for a substitute to synthesizing a reliable supervision target. Leveraging the intrinsic geometric stability of intra-modal data, IN2R employs a Graph Refiner to perform relational reasoning over neighbors retrieved from a dynamic Cross-Model Memory. Instead of propagating discrete labels, our method synthesizes a continuous, soft prototype that reflects the consensus of the local semantic neighborhood, effectively rectifying inter-modal misalignment. Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN2R significantly outperforms state-of-the-art methods. Our code and pre-trained models are publicly available at https://github.com/liuyyy111/IN2R.

Problem

Research questions and friction points this paper is trying to address.

noisy correspondence

cross-modal retrieval

inter-modal misalignment

web-harvested datasets

model generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal retrieval

noisy correspondence

intra-modal reasoning