🤖 AI Summary
To address inaccurate spatiotemporal localization of target objects in video question answering (VQA), this paper proposes a three-stage framework—decomposed reasoning, spatiotemporal localization, and object tracking—to enhance vision-language alignment in multimodal large language models (MLLMs). The core contribution is the introduction of “trigger moments”: keyframes where the target object is most saliently visible, automatically generated via CORTEX prompting and used as spatiotemporal anchors to improve localization consistency and robustness. The method integrates MLLMs, VQA reasoning, explicit spatiotemporal localization, and lightweight object tracking, with prompt engineering optimized for precise keyframe identification. Evaluated on the GVQA benchmark, our approach achieves a HOTA score of 0.4968—surpassing last year’s champion (0.2704) by 22.6 percentage points—demonstrating the effectiveness of the trigger-moment-guided paradigm.
📝 Abstract
In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning &QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.