InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited temporal understanding and interactive capabilities of open-source foundation models in long-horizon multimodal tasks by proposing a Multimodal Contextual Reasoning (MCR) framework, which formulates the task as a closed-loop process encompassing observation, instruction, reasoning, tool actions, and memory. The core innovations include a Multimodal Multi-head Latent Attention (M²LA) mechanism and an efficient key-value cache compression strategy, integrated within a staged training pipeline comprising continued pretraining, long- and short-video fine-tuning, rule-based reinforcement learning, and policy distillation. The proposed approach achieves state-of-the-art performance on benchmarks such as Video-MME, MLVU, and EgoSchema, marking the first demonstration of efficient, evidence-driven agent behavior over long videos within an open-source framework.

📝 Abstract

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

Problem

Research questions and friction points this paper is trying to address.

multimodal

long-horizon

video understanding

temporal reasoning

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Contextual Reasoning

M^2LA

long-horizon video understanding