Temporal-Aware Reasoning Optimization for Video Temporal Grounding

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing multimodal large language model–based approaches to video temporal grounding, which suffer from inefficient random exploration and reward mechanisms that prioritize answer correctness over reasoning quality, thereby hindering precise localization of target segments. To overcome these issues, the authors propose TaRO, a novel framework that constructs dense caption– and timestamp–guided reasoning paths and introduces a time-sensitive reward function that, for the first time, explicitly evaluates reasoning quality based on the model’s reliance on event boundaries. Integrated with a progressive curriculum learning strategy, TaRO enables the model to transition smoothly from guided reasoning to autonomous, efficient exploration. Experimental results demonstrate that TaRO achieves state-of-the-art performance across multiple video temporal grounding benchmarks, significantly improving both localization accuracy and reasoning fidelity.
📝 Abstract
Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model's ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. High-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted, such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Experiments demonstrate that TaRO achieves state-of-the-art performance on VTG benchmarks. Code is available at https://github.com/oceanflowlab/TaRO.
Problem

Research questions and friction points this paper is trying to address.

video temporal grounding
reasoning quality
reinforcement learning
temporal localization
multi-modal large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-Aware Reasoning
Constructive Reasoning Exploration
Temporal-Sensitivity Reward
Video Temporal Grounding
Multi-modal Large Language Models