ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of constructing immersive stereo video for VR/AR, this paper introduces the first voxel video framework enabling mobile, synchronized multi-view audio-visual capture. Methodologically, we design a mobile multimodal acquisition system integrating 5K/60FPS high-speed imaging, spatial alignment, and precise temporal synchronization, and build an end-to-end multimodal voxel reconstruction pipeline. We further propose the first reconstruction benchmark and evaluation protocol tailored for 6-DoF immersive VR. Our contributions are threefold: (1) the release of ImViD—a novel immersive voxel video dataset featuring multiple dynamic scenes with 1–5 minute synchronized audio-visual sequences; (2) the establishment of the first 6-DoF multimodal VR reconstruction benchmark; and (3) empirical validation of baseline methods under high-fidelity rendering, large interactive volumes, and multimodal feedback—thereby establishing a new paradigm for immersive content generation.

Technology Category

Application Category

📝 Abstract
User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution&frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.
Problem

Research questions and friction points this paper is trying to address.

Enhance VR engagement with immersive volumetric videos.
Develop a dataset for multi-view, multi-modal VR content.
Create a pipeline for 6-DoF immersive VR experiences.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view, multi-modal dataset for volumetric videos
5K resolution at 60FPS with synchronized audio
Base pipeline for 6-DoF immersive VR experiences
🔎 Similar Papers
No similar papers found.
Z
Zhengxian Yang
Tsinghua University
S
Shi Pan
Department of Building Science, Tsinghua University
S
Shengqi Wang
Tsinghua University
H
Haoxiang Wang
Tsinghua University
L
Li Lin
Migu Beijing Research Institute
Guanjun Li
Guanjun Li
Institute of Automation,Chinese Academy of Sciences
Audio ProcessingAudio-visual Leanring
Zhengqi Wen
Zhengqi Wen
Tshinghua University
LLM
B
Borong Lin
Department of Building Science, Tsinghua University
J
Jianhua Tao
BNRist&Department of Automation, Tsinghua University
T
Tao Yu
BNRist&Department of Automation, Tsinghua University