TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing video-language models exhibit near-random performance on formal temporal logic question answering due to their limited ability to precisely localize actions in time. This work proposes a three-tier system: first, questions are parsed into executable temporal logic programs grounded in reconstructed ground-truth action timelines and evaluated deterministically; in the absence of such annotations, the system falls back to a strong open-source video-language model; finally, challenging cases are routed to a state-of-the-art reasoning model. Our approach achieves the first explicit temporal logic reasoning based on real-world timelines, revealing temporal localization as the key performance bottleneck and demonstrating that precise annotations yield greater gains than merely scaling model size. On the TimeLogic Challenge, our method improves accuracy from 46.9% to 71.37%, an absolute gain of 24.5 percentage points, approaching state-of-the-art performance.

📝 Abstract

The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, ...) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video's action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.

Problem

Research questions and friction points this paper is trying to address.

temporal-logic reasoning

video question answering

action localization

video-language models

temporal grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-Logic Grounding

Video Question Answering

Action Timeline Reconstruction