Scene-Text Grounding for Text-Based Video Question Answering

📅 2024-09-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing Text-VideoQA methods suffer from opaque decision-making and excessive reliance on scene text recognition (STR). To address this, we propose *Grounded Text-VideoQA*, a new task requiring models not only to answer questions but also to perform spatiotemporal localization of relevant scene text regions—thereby decoupling question answering from STR and enhancing interpretability and evaluation validity. Our contributions are threefold: (1) the first formal task definition; (2) a weakly supervised spatiotemporal disentangled contrastive learning framework integrating multimodal alignment and joint video–text–glyph triplet training; and (3) ViTXT-GQA, the first large-scale benchmark with fine-grained spatiotemporal text annotations. Experiments demonstrate significant improvements over strong baselines. Ablation and error analyses reveal that current STR capability remains the fundamental bottleneck, and human performance substantially exceeds even the best model—highlighting critical directions for future work.

Technology Category

Application Category

📝 Abstract
Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text recognition? To achieve Grounded TextVideoQA, we propose the T2S-QA model that highlights a disentangled temporal-to-spatial contrastive learning strategy for weakly-supervised scene-text grounding and grounded TextVideoQA. To facilitate evaluation, we construct a new dataset ViTXT-GQA which features 52K scene-text bounding boxes within 2.2K temporal segments related to 2K questions and 729 videos. With ViTXT-GQA, we perform extensive experiments and demonstrate the severe limitations of existing techniques in Grounded TextVideoQA. While T2S-QA achieves superior results, the large performance gap with human leaves ample space for improvement. Our further analysis of oracle scene-text inputs posits that the major challenge is scene-text recognition. To advance the research of Grounded TextVideoQA, our dataset and code are at https://github.com/zhousheng97/ViTXT-GQA.git
Problem

Research questions and friction points this paper is trying to address.

Develops grounded TextVideoQA requiring spatio-temporal scene-text localization
Decouples question answering from scene-text recognition for interpretability
Addresses ineffective answer evaluation by using visual text regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-to-spatial contrastive learning strategy
Weakly-supervised scene-text grounding approach
Decouples QA from scene-text recognition
🔎 Similar Papers
No similar papers found.
S
Sheng Zhou
Hefei University of Technology, Hefei, 230601, China
Junbin Xiao
Junbin Xiao
National University of Singapore
Video and LanguageEmbodied InteractionTrustworthy Multimodality
X
Xun Yang
University of Science and Technology of China, Hefei, 230026, China
Peipei Song
Peipei Song
University of Science and Technology of China
MultimediaComputer VisionMachine Learning
Dan Guo
Dan Guo
IEEE senior member, Professor, Hefei University of Technology
Multimedia ComputingArtificial Intelligence
Angela Yao
Angela Yao
National University of Singapore
computer visiondeep learningmachine learning
M
Meng Wang
Hefei University of Technology, Hefei, 230601, China
T
Tat-Seng Chua
National University of Singapore, 117418, Singapore