A Synchronized Audio-Visual Multi-View Capture System

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing systems struggle to support strict audio-visual synchronization, limiting the analysis of fine-grained temporal features in dialogue such as turn-taking, overlapping speech, and prosody. To address this challenge, this work proposes an end-to-end multimodal acquisition and calibration framework that treats synchronized audio and video as equally central modalities for the first time. By integrating a multi-camera array with multi-channel microphones under a unified temporal architecture, the system enables scalable, reproducible, high-quality recording. Standardized calibration and quality control procedures ensure high temporal consistency across modalities, yielding data that effectively supports fine-grained analysis of conversational behavior and data-driven modeling.

Technology Category

Application Category

📝 Abstract
Multi-view capture systems have been an important tool in research for recording human motion under controlling conditions. Most existing systems are specified around video streams and provide little or no support for audio acquisition and rigorous audio-video alignment, despite both being essential for studying conversational interaction where timing at the level of turn-taking, overlap, and prosody matters. In this technical report, we describe an audio-visual multi-view capture system that addresses this gap by treating synchronized audio and synchronized video as first-class signals. The system combines a multi-camera pipeline with multi-channel microphone recording under a unified timing architecture and provides a practical workflow for calibration, acquisition, and quality control that supports repeatable recordings at scale. We quantify synchronization performance in deployment and show that the resulting recordings are temporally consistent enough to support fine-grained analysis and data-driven modeling of conversation behavior.
Problem

Research questions and friction points this paper is trying to address.

multi-view capture
audio-visual synchronization
conversational interaction
temporal alignment
human motion recording
Innovation

Methods, ideas, or system contributions that make the work stand out.

synchronized audio-visual capture
multi-view system
temporal alignment
conversational interaction
unified timing architecture
🔎 Similar Papers
No similar papers found.
Xiangwei Shi
Xiangwei Shi
PhD student, Delft University of Technology
computer vision
E
Era Dorta Perez
Delft University of Technology, Netherlands
R
Ruud de Jong
Delft University of Technology, Netherlands
O
Ojas Shirekar
Delft University of Technology, Netherlands
Chirag Raman
Chirag Raman
Delft University of Technology
Multimodal Machine LearningComputer VisionHuman-Computer Interaction