Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address inaccurate spatiotemporal localization of target objects in video question answering (VQA), this paper proposes a three-stage framework—decomposed reasoning, spatiotemporal localization, and object tracking—to enhance vision-language alignment in multimodal large language models (MLLMs). The core contribution is the introduction of “trigger moments”: keyframes where the target object is most saliently visible, automatically generated via CORTEX prompting and used as spatiotemporal anchors to improve localization consistency and robustness. The method integrates MLLMs, VQA reasoning, explicit spatiotemporal localization, and lightweight object tracking, with prompt engineering optimized for precise keyframe identification. Evaluated on the GVQA benchmark, our approach achieves a HOTA score of 0.4968—surpassing last year’s champion (0.2704) by 22.6 percentage points—demonstrating the effectiveness of the trigger-moment-guided paradigm.

Technology Category

Application Category

📝 Abstract

In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning &QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.

Problem

Research questions and friction points this paper is trying to address.

Pinpointing trigger moments for spatio-temporal grounding in video QA

Enhancing object tracking accuracy in multimodal large language models

Improving visual grounding capabilities for complex video reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing trigger moment for robust temporal anchoring

Decomposing GVQA into three-stage video reasoning pipeline

Using CORTEX prompt to identify most visible object frame

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs