MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

๐Ÿ“… 2026-06-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing vision-language models struggle to effectively model global semantics in hour-long videos due to token explosion and attention dilution. This work proposes an agent framework that decouples perception from reasoning, constructing a hierarchical graph memory through incremental video streaming and integrating a tool-augmented retrieval mechanism grounded in an observeโ€“reasonโ€“act loop. The approach enables efficient logical navigation over long-form video content and represents the first scalable solution for long-video semantic understanding. It achieves state-of-the-art performance across four mainstream benchmarks, delivering a 12.5-point absolute accuracy gain while utilizing only 2% of the full input context window, thereby narrowing the performance gap with human experts to just 3.7 points.
๐Ÿ“ Abstract
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.
Problem

Research questions and friction points this paper is trying to address.

long video understanding
vision-language models
token explosion
attention dilution
multimodal comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Graph Memory
Agentic Retrieval
Perception-Reasoning Decoupling
Long Video Understanding
Tool-Augmented Reasoning