4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

๐Ÿ“… 2025-12-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current multimodal large language models (MLLMs) exhibit significant limitations in modeling 4D (3D + temporal) structures and performing dynamic reasoning, while lacking benchmarks and methods supporting region-level fine-grained understanding. To address this, we propose the first MLLM framework for region-level 4D understanding: (1) We introduce R4D-Bench, the first region-level 4D video question answering benchmark; (2) We propose Perception-aware 4D Distillation (P4D), enabling efficient 4D representation transfer from frozen expert models to lightweight MLLMs; (3) We design RGPTโ€”a novel 4D-perception-specific architecture integrating joint videoโ€“point cloud representations and region-level visual prompting. Our method achieves substantial improvements over state-of-the-art approaches across multiple 4D VQA benchmarks and attains leading performance on R4D-Bench. This work provides the first systematic validation of the effectiveness and scalability of region-level 4D perception and reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
Problem

Research questions and friction points this paper is trying to address.

Enhances 4D perception and temporal reasoning in MLLMs
Addresses lack of region-level prompting in 4D VQA benchmarks
Improves understanding of depth-aware dynamic scenes via distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D-RGPT model enhances temporal perception from video inputs
Perceptual 4D Distillation transfers 4D representations from expert model
R4D-Bench benchmark introduces region-level prompting for dynamic scenes
๐Ÿ”Ž Similar Papers
No similar papers found.