VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge of evaluating zero-shot identity recognition of single-source audio content using generic audio representations. We introduce VocSim, the first training-free, zero-shot audio content identity benchmark—comprising 125k single-source clips—focused on geometric robustness across speech, animal vocalizations, and environmental sounds. Methodologically, we propose an unsupervised geometric evaluation paradigm under single-source constraints, introducing the Global Separation Rate (GSR) and calibrated lift metrics. Our pipeline employs a frozen Whisper encoder, time-frequency pooling, and label-free PCA embedding, jointly evaluated via Precision@k and GSR. Experiments reveal a geometric generalization gap for foundation models on low-resource, blind speech recognition; VocSim achieves SOTA on the HEAR benchmark. Moreover, the learned embeddings correlate with avian perceptual similarity and improve bioacoustic classification performance. Code, data, and a live leaderboard are publicly released.

Technology Category

Application Category

📝 Abstract

General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings. VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, we isolate content representation from the confound of source separation. We evaluate embeddings using Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation. To calibrate GSR, we report lift over an empirical permutation baseline. Across diverse foundation models, a simple pipeline, frozen Whisper encoder features, time-frequency pooling, and label-free PCA, yields strong zero-shot performance. However, VocSim also uncovers a consistent generalization gap. On blind, low-resource speech, local retrieval drops sharply. While performance remains statistically distinguishable from chance, the absolute geometric structure collapses, indicating a failure to generalize to unseen phonotactics. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art results on the HEAR benchmark. We posit that the intrinsic geometric quality measured here proxies utility in unlisted downstream applications. We release data, code, and a public leaderboard to standardize the evaluation of intrinsic audio geometry.

Problem

Research questions and friction points this paper is trying to address.

Evaluates zero-shot content identity in single-source audio embeddings.

Measures geometric alignment of frozen embeddings without training.

Assesses generalization gaps in unseen speech and sound data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free benchmark evaluating frozen audio embeddings

Single-source audio isolating content from separation confounds

Geometric metrics assessing local purity and global separation

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones