Admitting Ignorance Helps the Video Question Answering Models to Answer

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current VideoQA models suffer from spurious video-question correlations, leading to incorrect answers due to insufficient causal robustness. To address this, we propose an “admit ignorance” training paradigm: counterfactual interventions are constructed via question displacement and semantic perturbation, compelling the model to explicitly output “unknown” when video-question alignment fails. We design a lightweight knowledge-agnostic identification module and an adapter framework that seamlessly integrates with state-of-the-art video-text foundation models. Our method uniformly supports both multiple-choice and open-ended VideoQA tasks. Extensive experiments on major benchmarks demonstrate significant accuracy improvements across all settings, with minimal architectural modifications. Results validate dual enhancements—improved causal robustness against spurious correlations and stronger cross-scenario generalization—without sacrificing model expressiveness or inference efficiency.

Technology Category

Application Category

📝 Abstract
Significant progress has been made in the field of video question answering (VideoQA) thanks to deep learning and large-scale pretraining. Despite the presence of sophisticated model structures and powerful video-text foundation models, most existing methods focus solely on maximizing the correlation between answers and video-question pairs during training. We argue that these models often establish shortcuts, resulting in spurious correlations between questions and answers, especially when the alignment between video and text data is suboptimal. To address these spurious correlations, we propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question, rather than making guesses solely based on superficial question-answer correlations. We introduce methodologies for intervening in questions, utilizing techniques such as displacement and perturbation, and design frameworks for the model to admit its lack of knowledge in both multi-choice VideoQA and open-ended settings. In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness. The results clearly demonstrate that our framework can significantly enhance the performance of VideoQA models with minimal structural modifications.
Problem

Research questions and friction points this paper is trying to address.

VideoQA
Incorrect Association
Accuracy Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty Awareness
Question Modification
Distraction Introduction
🔎 Similar Papers
No similar papers found.