TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

165K/year
🤖 AI Summary
Existing video-language models exhibit near-random performance on formal temporal logic question answering due to their limited ability to precisely localize actions in time. This work proposes a three-tier system: first, questions are parsed into executable temporal logic programs grounded in reconstructed ground-truth action timelines and evaluated deterministically; in the absence of such annotations, the system falls back to a strong open-source video-language model; finally, challenging cases are routed to a state-of-the-art reasoning model. Our approach achieves the first explicit temporal logic reasoning based on real-world timelines, revealing temporal localization as the key performance bottleneck and demonstrating that precise annotations yield greater gains than merely scaling model size. On the TimeLogic Challenge, our method improves accuracy from 46.9% to 71.37%, an absolute gain of 24.5 percentage points, approaching state-of-the-art performance.
📝 Abstract
The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, ...) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video's action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.
Problem

Research questions and friction points this paper is trying to address.

temporal-logic reasoning
video question answering
action localization
video-language models
temporal grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-Logic Grounding
Video Question Answering
Action Timeline Reconstruction
Source-Annotation Utilization
Category-Targeted Reasoning
🔎 Similar Papers
2024-08-08International Journal of Computer VisionCitations: 13