🤖 AI Summary
This work addresses the challenge of complex temporal logical reasoning in video question answering—such as event existence, ordering, duration, and boundary overlap—by proposing a visual evidence routing framework that decouples perception from symbolic temporal reasoning. The approach first parses the question structure and dynamically routes it to an appropriate processing strategy based on video length and operator complexity. It leverages multimodal large language models to generate structured visual evidence, which is then subjected to precise logical operations via a programmatic verifier and a deterministic temporal reducer. Innovatively integrating structured evidence with formal temporal rules, the method employs a conservative answer fusion strategy to enhance consistency. Evaluated on the TimeLogicQA benchmark, it achieves an average accuracy of 81.8%.
📝 Abstract
TimeLogicQA evaluates whether video question answering systems can reason over temporal relations such as event existence, ordering, persistence, boundary conditions, and overlap. We address this task with a visual evidence routing pipeline that separates perception from symbolic temporal reasoning. The system first parses each question into event targets, answer mode, candidate options, and temporal operators. It then routes videos according to duration and operator difficulty, using ordered full-frame evidence for short clips and event-focused candidate windows for long videos. A multimodal large language model produces structured visual evidence for the relevant events, while programmatic verifiers recover dense action intervals and a deterministic reducer applies operator-specific temporal rules to produce the final answer. Conservative fusion accepts an answer only when the visual evidence, temporal program, and confidence checks agree, reducing noisy answer flips. On the official test evaluation, our final system achieves an AvgAcc of 81.8.