🤖 AI Summary
To address the lack of authentic spatial auditory feedback for first-person voice interaction in immersive VR, this paper proposes a real-time acoustic rendering system based on three-dimensional spatial impulse response (SIR) modeling and low-latency convolutional reverberation. The system establishes, for the first time, a closed-loop self-hearing feedback mechanism driven by SIR and integrates it within a VR audio-visual synchronization framework. It supports coordinated 3DoF/5DoF perception, enabling millisecond-level coupling between user vocalization and dynamically evolving acoustic environments—permitting unconstrained locomotion and head orientation changes. Unlike conventional approaches, our method overcomes the longstanding trade-off between spatial acoustic fidelity and real-time performance in VR voice interaction. The system has been successfully deployed in multimodal research applications, including musical training and acoustic space exploration.
📝 Abstract
Multimodal research and applications are becoming more commonplace as Virtual Reality (VR) technology integrates different sensory feedback, enabling the recreation of real spaces in an audio-visual context. Within VR experiences, numerous applications rely on the user's voice as a key element of interaction, including music performances and public speaking applications. Self-perception of our voice plays a crucial role in vocal production. When singing or speaking, our voice interacts with the acoustic properties of the environment, shaping the adjustment of vocal parameters in response to the perceived characteristics of the space. This technical report presents a real-time auralization pipeline that leverages three-dimensional Spatial Impulse Responses (SIRs) for multimodal research applications in VR requiring first-person vocal interaction. It describes the impulse response creation and rendering workflow, the audio-visual integration, and addresses latency and computational considerations. The system enables users to explore acoustic spaces from various positions and orientations within a predefined area, supporting three and five Degrees of Freedom (3Dof and 5DoF) in audio-visual multimodal perception for both research and creative applications in VR.