LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the challenge that multimodal large language models struggle to retain and retrieve spatial layouts, trajectories, viewpoints, and object states in long-duration videos. To overcome this limitation, the authors propose LongSpace, a framework that models video chunks as sequences, integrates 3D structural cues early in the decoder, and introduces a hierarchical, query-guided spatial memory mechanism. As the first systematic evaluation of long-term spatial memory in video understanding, LongSpace incorporates explicit spatial memory with a layer-aware memory architecture, significantly improving performance across multiple spatial reasoning benchmarks on long videos. The results demonstrate the critical role of explicit spatial memory in enabling effective multimodal video reasoning.

📝 Abstract

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

Problem

Research questions and friction points this paper is trying to address.

long-horizon spatial memory

video understanding

spatial reasoning

multimodal large language models

spatial recall

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial memory

long-horizon reasoning

3D structural cues