🤖 AI Summary
This paper addresses the challenge of jointly modeling strong pairwise alignment and higher-order (e.g., XOR-type) inter-modal dependencies in multimodal joint representation learning. To this end, we propose ConFu, a contrastive fusion framework that jointly optimizes unimodal and fused multimodal representations within a unified embedding space. ConFu introduces, for the first time, a fused-modal contrastive loss that explicitly captures higher-order interactions and enables both one-to-one bidirectional and two-to-one cross-modal retrieval. By extending the contrastive learning objective and co-optimizing multimodal fusion encoders with the joint embedding space, ConFu achieves significant improvements over state-of-the-art methods on synthetic and real-world benchmarks—including MM-IMDB and Clotho—across cross-modal retrieval and classification tasks. Moreover, the framework exhibits strong computational scalability.
📝 Abstract
Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.