🤖 AI Summary
Addressing the dual challenges of insufficient in-distribution (ID) classification accuracy and poor out-of-distribution (OOD) detection generalization in multimodal intent understanding, this paper proposes a unified modeling paradigm. First, we design a weighted dynamic feature fusion network to enhance high-level cross-modal semantic alignment. Second, we introduce a novel pseudo-OOD generation strategy based on convex combinations of ID data, effectively mitigating the scarcity of authentic OOD samples. Third, we construct a multi-granularity contrastive representation learning framework that jointly optimizes coarse-grained ID/OOD discrimination and fine-grained intra-class/instance-level interactions. Evaluated on three mainstream multimodal intent datasets, our method achieves 3–10% improvements in AUROC for OOD detection while attaining new state-of-the-art ID classification accuracy. Furthermore, we release the first standardized multimodal OOD evaluation benchmark to foster reproducible and comparable research.
📝 Abstract
Multimodal intent understanding is a significant research area that requires effective leveraging of multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing the nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations for both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective not only enhances the discrimination between different ID classes but also captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3~10% increase in AUROC scores while achieving new state-of-the-art results in ID classification. Data and codes are available at https://github.com/thuiar/MIntOOD.