🤖 AI Summary
The education domain suffers from a severe scarcity of large-scale, publicly available classroom speech data; existing datasets are small, non-open, and lack classroom-specific noise and room impulse responses (RIRs), hindering robust speech model training and effective data augmentation. Method: We propose the first game-engine-based classroom acoustic scene simulation framework, leveraging Unity’s real-time rendering to synthesize high-fidelity, configurable RIRs and background noise. Integrating children’s speech, instructional video audio, and public corpora, we construct RealClass—a synthetic classroom speech dataset enabling end-to-end controllable speech synthesis and diverse noisy sample generation. Contribution/Results: Experiments demonstrate that RealClass closely matches real classroom speech in both acoustic characteristics and ASR performance. When used for pretraining or fine-tuning, it significantly improves generalization of educational speech models under both clean and noisy conditions.
📝 Abstract
The scarcity of large-scale classroom speech data has hindered the development of AI-driven speech models for education. Classroom datasets remain limited and not publicly available, and the absence of dedicated classroom noise or Room Impulse Response (RIR) corpora prevents the use of standard data augmentation techniques.
In this paper, we introduce a scalable methodology for synthesizing classroom noise and RIRs using game engines, a versatile framework that can extend to other domains beyond the classroom. Building on this methodology, we present RealClass, a dataset that combines a synthesized classroom noise corpus with a classroom speech dataset compiled from publicly available corpora. The speech data pairs a children's speech corpus with instructional speech extracted from YouTube videos to approximate real classroom interactions in clean conditions. Experiments on clean and noisy speech show that RealClass closely approximates real classroom speech, making it a valuable asset in the absence of abundant real classroom speech.