Contrastive Learning from Synthetic Audio Doppelgangers

📅 2024-06-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Audio representation learning heavily relies on large-scale real-world recordings, while manual annotation and data augmentation struggle to capture the full diversity of physical acoustics. Method: We propose a synthetic-driven contrastive learning framework that requires no real audio data. It employs differentiable and stochastic sound synthesizers to generate physically consistent synthetic “twin” positive pairs online via causally interpretable parameter perturbations (e.g., timbre, pitch, envelope), thereby constructing high-diversity contrastive tasks. Contribution/Results: We introduce the first positive-pair construction paradigm grounded in causal perturbations of synthesizer parameters; require only a single interpretable hyperparameter and zero real-data storage; and achieve, for the first time, synthetic-data-only models that surpass real-data baselines on ESC-50, UrbanSound8K, and SpeechCommands. This significantly reduces data dependency and storage overhead, establishing a new paradigm for low-resource audio representation learning.

Technology Category

Application Category

📝 Abstract
Learning robust audio representations currently demands extensive datasets of real-world sound recordings. By applying artificial transformations to these recordings, models can learn to recognize similarities despite subtle variations through techniques like contrastive learning. However, these transformations are only approximations of the true diversity found in real-world sounds, which are generated by complex interactions of physical processes, from vocal cord vibrations to the resonance of musical instruments. We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio. By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelg""angers-synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes. These variations, difficult to achieve through augmentations of existing audio, provide a rich source of contrastive information. Despite the shift to randomly generated synthetic data, our method produces strong representations, outperforming real data on several standard audio classification tasks. Notably, our approach is lightweight, requires no data storage, and has only a single hyperparameter, which we extensively analyze. We offer this method as a complement to existing strategies for contrastive learning in audio, using synthesized sounds to reduce the data burden on practitioners.
Problem

Research questions and friction points this paper is trying to address.

Learning robust audio representations requires large real-world datasets.
Existing audio transformations lack true real-world sound diversity.
Synthetic audio doppelgängers provide rich contrastive learning information.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic audio doppelgängers for contrastive learning
Randomly perturbed sound synthesizer parameters
Lightweight method with single hyperparameter
🔎 Similar Papers
No similar papers found.