MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of long-term memory capabilities in existing video world models, particularly their deficiencies in entity persistence, environmental consistency, and causal coherence. The study introduces the first comprehensive benchmark that formally defines and decomposes memory capacity into a three-tiered framework encompassing twelve sub-dimensions. Built upon real-world long-form video data, the evaluation integrates rule-based quantitative metrics with visual-language models (VLMs) to assess multi-faceted consistency. Systematic assessment of state-of-the-art models reveals pervasive shortcomings in maintaining coherent long-term states, thereby establishing a standardized benchmark and delineating clear directions for future research advancement.

📝 Abstract

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.

Problem

Research questions and friction points this paper is trying to address.

memory capability

video world models

long-term consistency

benchmark

internal state

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory benchmark

video world models

long-term consistency