SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing video understanding benchmarks struggle to simultaneously evaluate causal reasoning and strategic planning in realistic multi-agent scenarios, as real-world videos lack verifiable annotations while synthetic environments are overly simplified. This work proposes a “strategic video intelligence” capability stack and introduces a large-scale dynamic microworld benchmark grounded in team sports, which integrates real-world complexity with rule-based verifiability. By fusing broadcast footage, commentary, action annotations, match reports, and statistics, the authors construct a cross-modal, densely annotated corpus and design nine progressively challenging evaluation tasks. Experiments reveal that models perform reasonably well on perceptual tasks—achieving 73% accuracy on fine-grained action-based question answering—but suffer a sharp performance drop to 5% on higher-order tasks requiring autonomous evidence integration, exposing a significant cognitive gap and underscoring the necessity of this benchmark for evaluating advanced video understanding capabilities.

📝 Abstract

True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.

Problem

Research questions and friction points this paper is trying to address.

Strategic Video Intelligence

causal reasoning

multi-agent systems

video benchmark

strategic planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Strategic Video Intelligence

Dynamic Microworld

Multi-agent Interaction