Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate spatiotemporal localization of target objects in video question answering (VQA), this paper proposes a three-stage framework—decomposed reasoning, spatiotemporal localization, and object tracking—to enhance vision-language alignment in multimodal large language models (MLLMs). The core contribution is the introduction of “trigger moments”: keyframes where the target object is most saliently visible, automatically generated via CORTEX prompting and used as spatiotemporal anchors to improve localization consistency and robustness. The method integrates MLLMs, VQA reasoning, explicit spatiotemporal localization, and lightweight object tracking, with prompt engineering optimized for precise keyframe identification. Evaluated on the GVQA benchmark, our approach achieves a HOTA score of 0.4968—surpassing last year’s champion (0.2704) by 22.6 percentage points—demonstrating the effectiveness of the trigger-moment-guided paradigm.

Technology Category

Application Category

📝 Abstract
In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning &QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.
Problem

Research questions and friction points this paper is trying to address.

Pinpointing trigger moments for spatio-temporal grounding in video QA
Enhancing object tracking accuracy in multimodal large language models
Improving visual grounding capabilities for complex video reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing trigger moment for robust temporal anchoring
Decomposing GVQA into three-stage video reasoning pipeline
Using CORTEX prompt to identify most visible object frame
🔎 Similar Papers
No similar papers found.