Watch, Remember, Reason: Human-View Video Understanding with MLLMs

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses key challenges in long-form video understanding with multimodal large language models—namely sparse evidence, cross-modal misalignment, and computational constraints—by proposing a human-centric unified framework that integrates “watching, memory, and reasoning.” The framework systematically models structured relationships among perceptual representations, memory states, and reasoning trajectories. It combines fine-grained audiovisual perception, hybrid offline and streaming memory mechanisms, and joint text-video reasoning, enabling end-to-end training and efficient processing of long videos. To advance the field, the authors introduce a comprehensive evaluation benchmark and dataset spanning five video domains, clearly delineating core challenges and charting future research directions, thereby establishing both theoretical and practical foundations for multimodal large language models in video understanding.

📝 Abstract

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

Problem

Research questions and friction points this paper is trying to address.

video understanding

multimodal large language models

long-range dependencies

multimodal alignment

sparse evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-view video understanding

multimodal large language models

memory modeling