Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address insufficient semantic alignment in multimodal out-of-distribution (OoD) detection—caused by modality gaps in pretrained vision-language models (e.g., CLIP)—this paper proposes a cross-modal alignment regularization framework. Methodologically, it jointly fine-tunes image and text encoders by coupling maximum-likelihood estimation of a hyperspherical energy model with explicit cross-modal distance constraints, enforcing compact embedding alignment on the unit hypersphere. Additionally, a NegLabel post-processing strategy is introduced to enhance zero-shot discriminative robustness without requiring auxiliary OoD labels. Crucially, the approach mitigates knowledge forgetting during fine-tuning. Evaluated on the ImageNet-1k OoD benchmark, it achieves state-of-the-art OoD detection performance while simultaneously improving in-distribution (ID) classification accuracy—demonstrating synergistic enhancement of discriminability and generalization.

Technology Category

Application Category

📝 Abstract

Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naïve fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

Problem

Research questions and friction points this paper is trying to address.

Improving OoD detection via multi-modal fine-tuning

Addressing modality gap in ID embeddings for better performance

Enhancing cross-modal alignment to leverage pretrained knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal fine-tuning for OoD detection

Cross-modal alignment regularization technique

Hyperspherical representation space optimization

🔎 Similar Papers

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection