Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing video question answering systems often rely on spurious statistical correlations, struggling to achieve reliable and fine-grained evidence localization in complex scenarios. This work proposes a novel approach grounded in structural causal models, introducing causal intervention and counterfactual reasoning at the feature level for the first time. The method explicitly disentangles cross-modal representations into causal and non-causal components, leveraging independence and minimality constraints to enable fine-grained causal evidence separation without requiring costly annotations. Evaluated on the NExT-GQA, SportsQA, and SPORTU-video benchmarks, the approach significantly improves both answer accuracy and reasoning reliability, advancing the development of more trustworthy video question answering systems.

📝 Abstract

Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.

Problem

Research questions and friction points this paper is trying to address.

Counterfactual Reasoning

Evidence Disentanglement

VideoQA

Causal Inference

Fine-Grained Localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Reasoning

Evidence Disentanglement

Structural Causal Model