🤖 AI Summary
This study addresses the critical limitation in sound effects research caused by incompatible labeling schemes and metadata structures across existing datasets, which hinders data integration and cross-study comparison in both classification and generation tasks. To overcome this, the authors propose the first universal category system (UCS)-based relabeling framework tailored for academic use. The framework employs a rule-driven, multi-stage pipeline to unify heterogeneous labels across datasets, incorporating mechanisms for conflict resolution, hierarchical categorization, and cross-source alignment. By applying the industry-standard UCS to academic sound data for the first time, the work constructs and publicly releases EnvSound-UCS—a harmonized dataset integrating 58,057 samples from AudioSet, FSD50K, and ESC-50. This resource achieves high automated conversion rates while substantially improving label consistency, effectively mitigating sound data fragmentation.
📝 Abstract
Sound effects (SFX) datasets and libraries often employ distinct tagging schemes, taxonomies, and metadata structures. This creates challenges for research on SFX classification and generation because incompatible taxonomies lead to siloed datasets that might require individualized approaches, result in non-comparable outcomes, and prevent data merging strategies. We propose a modular dataset relabeling framework that adopts the Universal Category System (UCS), an industry-standard hierarchical taxonomy for sound effects, as a shared structural foundation. This open-source framework enables us (i) to convert tags of existing datasets to UCS with a rule-based multi-stage pipeline and conflict resolution to achieve high automatic conversion rates, (ii) to suggest a stratified dataset split for the new labels, and (iii) to combine multiple datasets. To showcase the practical utility, we introduce the EnvSound-UCS dataset, a publicly available unified UCS-compliant dataset of environmental sounds with 58,057 sound clips from three sources: AudioSet, FSD50K, and ESC-50.