Towards One-to-Many Temporal Grounding

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

145K/year
🤖 AI Summary
This work addresses the limitation of existing video temporal grounding methods, which are typically confined to single-segment retrieval and struggle with real-world scenarios where a single text query corresponds to multiple non-contiguous video segments. To tackle this challenge, the study introduces the one-to-many temporal grounding (OMTG) task, establishes the first OMTG benchmark dataset, and proposes two new evaluation metrics: Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1). Leveraging multimodal large language models, the approach integrates dense captioning and chain-of-thought reasoning to design a tailored reward function, enabling reinforcement learning to jointly optimize localization precision and completeness. On the OMTG Bench, the proposed model achieves an EtF1 score of 43.65%, substantially outperforming Gemini 2.5 Pro and Seed-1.8 by margins of 15.85% and 15.61%, respectively.
📝 Abstract
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Temporal Grounding
One-to-Many
Video Segmentation
Event Cardinality
Multimodal Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-to-Many Temporal Grounding
Temporal Grounding
Chain-of-Thought Reasoning
Reward Function Design
Video-Language Alignment