MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models struggle to effectively model global semantics in hour-long videos due to token explosion and attention dilution. This work proposes an agent framework that decouples perception from reasoning, constructing a hierarchical graph memory through incremental video streaming and integrating a tool-augmented retrieval mechanism grounded in an observe–reason–act loop. The approach enables efficient logical navigation over long-form video content and represents the first scalable solution for long-video semantic understanding. It achieves state-of-the-art performance across four mainstream benchmarks, delivering a 12.5-point absolute accuracy gain while utilizing only 2% of the full input context window, thereby narrowing the performance gap with human experts to just 3.7 points.

📝 Abstract

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Problem

Research questions and friction points this paper is trying to address.

long video understanding

vision-language models

token explosion

attention dilution

multimodal comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Graph Memory

Agentic Retrieval

Perception-Reasoning Decoupling