🤖 AI Summary
Existing multimodal relevance evaluation metrics are limited to bimodal settings and fail to capture higher-order (≥3) joint correlations, thereby compromising the accuracy and fairness of multimodal similarity modeling. To address this, we propose MAJORScore—the first unified relevance evaluation framework for N-modal (N ≥ 3) data. It leverages a pretrained contrastive learning model to construct a shared multimodal joint representation space, enabling cross-modal consistency modeling. Its core innovation lies in mapping heterogeneous modalities into a common latent space to directly quantify high-order joint relevance. Experiments demonstrate that MAJORScore improves scores by 26.03%–64.29% under modality-consistent conditions and reduces them by 13.28%–20.54% under inconsistency—significantly enhancing reliability and discriminative power. As a scalable, standardized benchmark, MAJORScore supports robust evaluation of large-scale multimodal datasets and models.
📝 Abstract
The multimodal relevance metric is usually borrowed from the embedding ability of pretrained contrastive learning models for bimodal data, which is used to evaluate the correlation between cross-modal data (e.g., CLIP). However, the commonly used evaluation metrics are only suitable for the associated analysis between two modalities, which greatly limits the evaluation of multimodal similarity. Herein, we propose MAJORScore, a brand-new evaluation metric for the relevance of multiple modalities (N modalities, N>=3) via multimodal joint representation for the first time. The ability of multimodal joint representation to integrate multiple modalities into the same latent space can accurately represent different modalities at one scale, providing support for fair relevance scoring. Extensive experiments have shown that MAJORScore increases by 26.03%-64.29% for consistent modality and decreases by 13.28%-20.54% for inconsistence compared to existing methods. MAJORScore serves as a more reliable metric for evaluating similarity on large-scale multimodal datasets and multimodal model performance evaluation.