🤖 AI Summary
Existing Text-VideoQA methods suffer from opaque decision-making and excessive reliance on scene text recognition (STR). To address this, we propose *Grounded Text-VideoQA*, a new task requiring models not only to answer questions but also to perform spatiotemporal localization of relevant scene text regions—thereby decoupling question answering from STR and enhancing interpretability and evaluation validity. Our contributions are threefold: (1) the first formal task definition; (2) a weakly supervised spatiotemporal disentangled contrastive learning framework integrating multimodal alignment and joint video–text–glyph triplet training; and (3) ViTXT-GQA, the first large-scale benchmark with fine-grained spatiotemporal text annotations. Experiments demonstrate significant improvements over strong baselines. Ablation and error analyses reveal that current STR capability remains the fundamental bottleneck, and human performance substantially exceeds even the best model—highlighting critical directions for future work.
📝 Abstract
Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text recognition? To achieve Grounded TextVideoQA, we propose the T2S-QA model that highlights a disentangled temporal-to-spatial contrastive learning strategy for weakly-supervised scene-text grounding and grounded TextVideoQA. To facilitate evaluation, we construct a new dataset ViTXT-GQA which features 52K scene-text bounding boxes within 2.2K temporal segments related to 2K questions and 729 videos. With ViTXT-GQA, we perform extensive experiments and demonstrate the severe limitations of existing techniques in Grounded TextVideoQA. While T2S-QA achieves superior results, the large performance gap with human leaves ample space for improvement. Our further analysis of oracle scene-text inputs posits that the major challenge is scene-text recognition. To advance the research of Grounded TextVideoQA, our dataset and code are at https://github.com/zhousheng97/ViTXT-GQA.git