๐ค AI Summary
Current multimodal large language models (MLLMs) exhibit significant limitations in modeling 4D (3D + temporal) structures and performing dynamic reasoning, while lacking benchmarks and methods supporting region-level fine-grained understanding. To address this, we propose the first MLLM framework for region-level 4D understanding: (1) We introduce R4D-Bench, the first region-level 4D video question answering benchmark; (2) We propose Perception-aware 4D Distillation (P4D), enabling efficient 4D representation transfer from frozen expert models to lightweight MLLMs; (3) We design RGPTโa novel 4D-perception-specific architecture integrating joint videoโpoint cloud representations and region-level visual prompting. Our method achieves substantial improvements over state-of-the-art approaches across multiple 4D VQA benchmarks and attains leading performance on R4D-Bench. This work provides the first systematic validation of the effectiveness and scalability of region-level 4D perception and reasoning.
๐ Abstract
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.