A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing multimodal models struggle to achieve effective three-way alignment among video, audio, and text modalities, primarily because they rely on pairwise similarity metrics (e.g., cosine similarity), neglecting the joint geometric structure across all three modalities—leading to suboptimal alignment and poor interpretability. To address this, we propose TRIANGLE: a neural geometric learning framework that directly models tri-modal alignment in high-dimensional embedding space via triangle-area-based similarity. Its core innovation is the first incorporation of geometric area as a principled similarity measure for multimodal alignment, eliminating conventional fusion layers and pairwise contrastive paradigms. TRIANGLE enables end-to-end, geometrically grounded, and inherently interpretable tri-modal alignment. Evaluated on video–text and audio–text retrieval, as well as audio–video classification, it achieves state-of-the-art performance, with up to a 9-percentage-point improvement in Recall@1.

Technology Category

Application Category

📝 Abstract

Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods may not provide indicators that all the involved modalities are effectively aligned. As a result, some modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle-area similarity, avoiding additional fusion layers or pairwise similarities. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.

Problem

Research questions and friction points this paper is trying to address.

Proposes TRIANGLE similarity for multimodal alignment beyond cosine similarity

Addresses ineffective modality alignment in current multimodal learning models

Improves three-modal alignment using triangle-area similarity in embedding space

Innovation

Methods, ideas, or system contributions that make the work stand out.

TRIANGLE computes similarity in high-dimensional embedding space

It aligns three modalities using triangle-area similarity measure

Replaces cosine similarity in contrastive losses for multimodal tasks

🔎 Similar Papers

What to align in multimodal contrastive learning?