T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of out-of-distribution (OOD) detection for vision-language models in open-world settings under temporal distribution shifts and covariate shift. The authors propose a two-stage temporally robust OOD detection framework, whose core innovation lies in extending dual-modal alignment to Temporal Quadruple Pattern Matching (T-QPM). This approach integrates image-text joint reasoning with a lightweight dynamic fusion mechanism and incorporates Average Thresholded Confidence (ATC) regularization to ensure long-term stability. Built upon the CLIP architecture, the method significantly outperforms static baselines on temporally partitioned benchmarks, achieving robust and temporally consistent multimodal OOD detection as well as enhanced domain generalization in non-stationary environments.

Technology Category

Application Category

📝 Abstract
Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.
Problem

Research questions and friction points this paper is trying to address.

out-of-distribution detection
temporal distribution shift
vision-language models
domain generalization
open-world learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Quadruple-Pattern Matching
Out-of-Distribution Detection
Vision-Language Models
Domain Generalization
Temporal Distribution Shift
🔎 Similar Papers
No similar papers found.