Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing self-supervised multimodal contrastive learning methods, which predominantly model cross-modal redundant information while neglecting modality-specific and interaction-induced synergistic components, often resulting in incomplete representations or information leakage. To overcome this, the authors propose COrAL, a novel framework that explicitly disentangles redundancy, specificity, and synergy within a unified architecture. COrAL employs a dual-path network, feature orthogonality constraints, and an asymmetric complementary masking mechanism to compel the model to infer cross-modal dependencies and learn structured representations. Extensive experiments demonstrate that COrAL achieves state-of-the-art or competitive performance on both synthetic benchmarks and multiple datasets from MultiBench, while exhibiting lower runtime variance—confirming the stability, reliability, and comprehensiveness of its learned representations.

Technology Category

Application Category

📝 Abstract
Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of information components. To promote synergy modeling, we introduce asymmetric masking with complementary view-specific patterns, compelling the model to infer cross-modal dependencies rather than rely solely on redundant cues. Extensive experiments on synthetic benchmarks and diverse MultiBench datasets demonstrate that COrAL consistently matches or outperforms state-of-the-art methods while exhibiting low performance variance across runs. These results indicate that explicitly modeling the full spectrum of multimodal information yields more stable, reliable, and comprehensive embeddings.
Problem

Research questions and friction points this paper is trying to address.

multimodal learning
contrastive learning
synergistic information
modality-specific information
information disentanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

orthogonalized representation
asymmetric masking
multimodal disentanglement
synergistic interaction
contrastive learning
🔎 Similar Papers
No similar papers found.
C
Carolin Cissee
Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover, Germany
R
Raneen Younis
Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover, Germany; Lower Saxony Center for AI and Causal Methods in Medicine (CAIMed), Hannover, Germany
Zahra Ahmadi
Zahra Ahmadi
Junior Group Leader, PLRI Medical Informatics Institute, Medical School of Hannover
Human-centered AIMultimodal LearningData MiningMachine Learning