TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval

📅 2025-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address pervasive label noise in image–text cross-modal retrieval—where existing methods suffer from insufficient robustness and performance saturation due to homogeneous co-teaching—this paper proposes a novel tripartite collaborative learning framework comprising heterogeneous Coordinator, Master, and Assistant models, each specializing in noise-aware semantic alignment. Our key contributions are: (1) the first tripartite heterogeneous collaboration mechanism, overcoming limitations of conventional homogeneous co-training; (2) a soft-label estimation method based on mutual information variation; and (3) a semantic variation consistency constraint loss coupled with a noise-aware collaborative training strategy. Evaluated on three mainstream benchmarks, our approach achieves state-of-the-art retrieval accuracy under high-noise conditions, while significantly improving training stability and convergence speed.

Technology Category

Application Category

📝 Abstract
Cross-modal retrieval maps data under different modality via semantic relevance. Existing approaches implicitly assume that data pairs are well-aligned and ignore the widely existing annotation noise, i.e., noisy correspondence (NC). Consequently, it inevitably causes performance degradation. Despite attempts that employ the co-teaching paradigm with identical architectures to provide distinct data perspectives, the differences between these architectures are primarily stemmed from random initialization. Thus, the model becomes increasingly homogeneous along with the training process. Consequently, the additional information brought by this paradigm is severely limited. In order to resolve this problem, we introduce a Tripartite learning with Semantic Variation Consistency (TSVC) for robust image-text retrieval. We design a tripartite cooperative learning mechanism comprising a Coordinator, a Master, and an Assistant model. The Coordinator distributes data, and the Assistant model supports the Master model's noisy label prediction with diverse data. Moreover, we introduce a soft label estimation method based on mutual information variation, which quantifies the noise in new samples and assigns corresponding soft labels. We also present a new loss function to enhance robustness and optimize training effectiveness. Extensive experiments on three widely used datasets demonstrate that, even at increasing noise ratios, TSVC exhibits significant advantages in retrieval accuracy and maintains stable training performance.
Problem

Research questions and friction points this paper is trying to address.

Cross-modal retrieval
Annotation errors
Co-teaching effectiveness degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Triplet Semantic Learning
Inter-modal Retrieval
Error-resilient Annotation
🔎 Similar Papers
No similar papers found.
S
Shuai Lyu
School of Computer Science, Beijing University of Posts and Telecommunications, China
Z
Zijing Tian
School of Science, Beijing University of Posts and Telecommunications, China
Zhonghong Ou
Zhonghong Ou
School of Computer Science, Beijing University of Posts and Telecommunications (BUPT), China
Computer VisionDeep LearningMachine LearningBig Data Analytics
Yifan Zhu
Yifan Zhu
Beijing University of Posts and Telecommunications
PEFT of LLMsGraph RAGGraph mining
X
Xiao Zhang
School of Computer Science, Beijing University of Posts and Telecommunications, China
Q
Qiankun Ha
School of Computer Science, Beijing University of Posts and Telecommunications, China
Haoran Luo
Haoran Luo
Nanyang Technological University
Knowledge GraphLarge Language ModelsGraph Neural Networks
Meina Song
Meina Song
Professor of Computer Science, Beijing University of Posts and Telecommunications
data science