MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
This work addresses the scarcity of high-quality, multidimensional structured annotations—spanning temporal, spatial, causal, and consequential dimensions—for video event reasoning by proposing a multi-stage, agent-driven automated annotation pipeline. The approach integrates Multi-Scale Spatio-Temporal Event Descriptions (MSTED) with chain-of-thought reasoning within an adaptive, domain-transferable agent architecture, enabling fully autonomous prompt redesign and hierarchical error tracing to support self-iterative refinement of the annotation process. Evaluated on over 5,300 traffic videos, the generated data substantially improves downstream question-answering performance: fine-tuned models achieve a 38.8-point lead over Gemini 2.5 Pro and 3.1 Flash on a private CCTV test set in multiple-choice accuracy, and yield a 10.7-point gain on AccidentBench using only CCTV data, further surpassing Gemini baselines after domain adaptation and reinforcement learning-based training.
📝 Abstract
Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.
Problem

Research questions and friction points this paper is trying to address.

video reasoning
structured annotation
vision language models
event understanding
data labeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic annotation
multi-stage pipeline
Chain-of-Thought reasoning
domain adaptation
video event reasoning
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30