Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video question answering systems often rely on spurious statistical correlations, struggling to achieve reliable and fine-grained evidence localization in complex scenarios. This work proposes a novel approach grounded in structural causal models, introducing causal intervention and counterfactual reasoning at the feature level for the first time. The method explicitly disentangles cross-modal representations into causal and non-causal components, leveraging independence and minimality constraints to enable fine-grained causal evidence separation without requiring costly annotations. Evaluated on the NExT-GQA, SportsQA, and SPORTU-video benchmarks, the approach significantly improves both answer accuracy and reasoning reliability, advancing the development of more trustworthy video question answering systems.
📝 Abstract
Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.
Problem

Research questions and friction points this paper is trying to address.

Counterfactual Reasoning
Evidence Disentanglement
VideoQA
Causal Inference
Fine-Grained Localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Reasoning
Evidence Disentanglement
Structural Causal Model
VideoQA
Causal Intervention
🔎 Similar Papers
2024-08-08International Journal of Computer VisionCitations: 13