Semantic-Cohesive Knowledge Distillation for Deep Cross-modal Hashing

๐Ÿ“… 2025-10-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing deep supervised cross-modal hashing methods lack explicit interaction between multi-label semantic extraction and original multimodal data, resulting in semantic representations incompatible with heterogeneous modalities and hindering effective modality alignment. Method: We propose SODA, a Semantic-coupled distillation framework that (i) reformulates multi-label annotations as semantic prompt modalities to establish imageโ€“labelโ€“text ternary interactions; (ii) employs a cross-modal teacher network to extract semantic priors and guide a student network to learn compact, semantically consistent hash codes directly in Hamming space; and (iii) integrates multi-label prompting, semantic consistency modeling, and knowledge distillation. Contribution/Results: SODA achieves significant improvements over state-of-the-art methods on two mainstream benchmarks. It effectively bridges the modality gap and enhances cross-modal retrieval accuracy by enforcing semantic coherence across heterogeneous modalities.

Technology Category

Application Category

๐Ÿ“ Abstract
Recently, deep supervised cross-modal hashing methods have achieve compelling success by learning semantic information in a self-supervised way. However, they still suffer from the key limitation that the multi-label semantic extraction process fail to explicitly interact with raw multimodal data, making the learned representation-level semantic information not compatible with the heterogeneous multimodal data and hindering the performance of bridging modality gap. To address this limitation, in this paper, we propose a novel semantic cohesive knowledge distillation scheme for deep cross-modal hashing, dubbed as SODA. Specifically, the multi-label information is introduced as a new textual modality and reformulated as a set of ground-truth label prompt, depicting the semantics presented in the image like the text modality. Then, a cross-modal teacher network is devised to effectively distill cross-modal semantic characteristics between image and label modalities and thus learn a well-mapped Hamming space for image modality. In a sense, such Hamming space can be regarded as a kind of prior knowledge to guide the learning of cross-modal student network and comprehensively preserve the semantic similarities between image and text modality. Extensive experiments on two benchmark datasets demonstrate the superiority of our model over the state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Multi-label semantics lack interaction with raw multimodal data
Learned semantic representations are incompatible with heterogeneous data
Current methods struggle to effectively bridge the modality gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-label information as textual modality
Devises cross-modal teacher network for semantic distillation
Learns Hamming space as prior knowledge for guidance
๐Ÿ”Ž Similar Papers
No similar papers found.
Changchang Sun
Changchang Sun
University of Illinois Chicago
Multimedia RetrievalComputer VisionMachine Learning
V
Vickie Chen
Rensselaer Polytechnic Institute
Y
Yan Yan
University of Illinois Chicago