🤖 AI Summary
Existing video question answering systems often rely on spurious statistical correlations, struggling to achieve reliable and fine-grained evidence localization in complex scenarios. This work proposes a novel approach grounded in structural causal models, introducing causal intervention and counterfactual reasoning at the feature level for the first time. The method explicitly disentangles cross-modal representations into causal and non-causal components, leveraging independence and minimality constraints to enable fine-grained causal evidence separation without requiring costly annotations. Evaluated on the NExT-GQA, SportsQA, and SPORTU-video benchmarks, the approach significantly improves both answer accuracy and reasoning reliability, advancing the development of more trustworthy video question answering systems.
📝 Abstract
Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.