Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the dual challenges of insufficient in-distribution (ID) classification accuracy and poor out-of-distribution (OOD) detection generalization in multimodal intent understanding, this paper proposes a unified modeling paradigm. First, we design a weighted dynamic feature fusion network to enhance high-level cross-modal semantic alignment. Second, we introduce a novel pseudo-OOD generation strategy based on convex combinations of ID data, effectively mitigating the scarcity of authentic OOD samples. Third, we construct a multi-granularity contrastive representation learning framework that jointly optimizes coarse-grained ID/OOD discrimination and fine-grained intra-class/instance-level interactions. Evaluated on three mainstream multimodal intent datasets, our method achieves 3–10% improvements in AUROC for OOD detection while attaining new state-of-the-art ID classification accuracy. Furthermore, we release the first standardized multimodal OOD evaluation benchmark to foster reproducible and comparable research.

Technology Category

Application Category

📝 Abstract
Multimodal intent understanding is a significant research area that requires effective leveraging of multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing the nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations for both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective not only enhances the discrimination between different ID classes but also captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3~10% increase in AUROC scores while achieving new state-of-the-art results in ID classification. Data and codes are available at https://github.com/thuiar/MIntOOD.
Problem

Research questions and friction points this paper is trying to address.

Improves multimodal intent classification accuracy
Enhances out-of-distribution detection performance
Addresses limitations in capturing high-level semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted feature fusion for multimodal representation
Pseudo-OOD data synthesis from ID convex combinations
Coarse and fine-grained multimodal representation learning
🔎 Similar Papers
No similar papers found.
H
Hanlei Zhang
State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Qianrui Zhou
Qianrui Zhou
Computer Science PhD candidate, Tsinghua University
Multimodal Intent UnderstandingComputer VisionNatural Language Processing
H
Hua Xu
State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
J
Jianhua Su
School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China
R
Roberto Evans
State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
K
Kai Gao
School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China