SpeakerCard-1M: An Evidence-Grounded Speaker Card Corpus for In-the-Wild Speaker Verification

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the lack of interpretability in current speaker verification systems and the absence of real-world supervised data supporting natural language queries. The authors propose an evidence-driven paradigm for constructing speaker profiles, building a bilingual corpus comprising 56.7K speakers and 1.78 million utterance-level descriptions. Acoustic evidence is extracted via ten specialized probes, structurally aggregated into speaker representations, and transformed into interpretable profiles by a constrained large language model. The study introduces, for the first time, a new evaluation protocol encompassing bidirectional speaker–text retrieval and attribute-conditioned verification, thereby filling critical gaps in data and benchmarking for interpretable speaker verification. Experiments show that a dual-encoder model achieves an EER of 0.31% better than audio-only baselines on VoxCeleb1-O and attains 88.66% accuracy on pitch-based attribute verification—substantially outperforming prevailing audio-language models (49–77%).

📝 Abstract

Modern speaker verification (SV) systems rely on speaker embeddings that are effective but difficult to interpret or query in natural language. Most existing speech-text corpora target controllable synthesis or utterance-level captioning, and provide limited speaker-level supervision for in-the-wild speaker recognition. This paper introduces SpeakerCard-1M, a bilingual speaker-centric resource for evidence-grounded SV, derived from VoxCeleb1/2 and CN-Celeb1/2, where the "-1M" suffix refers to the 1.78M utterance-level captions contained in the release. We adopt a tool-first, LLM-last approach: ten acoustic probes produce field-level evidence, the evidence is aggregated into speaker profiles under a schema that separates relatively stable traits from utterance-level states, and bilingual Speaker Cards are rendered by a constrained LLM that sees only the structured fields. The release includes 56.7K Speaker Card records over 10.2K speakers, 1.78M utterance-level captions, and speaker-ID-disjoint hard-negative triplets. We further define two SV-oriented cross-modal protocols, bidirectional Speaker-Text Retrieval (T2S-R / S2T-R) and Attribute-Conditioned Verification (AC-Verify), and compare a dual-encoder baseline against recent audio language models under a zero-shot forced-choice setting. Joint audio-text training increases VoxCeleb1-O EER by 0.31% absolute over the audio-only baseline. Under a style-symmetric LLM-generated counterfactual protocol, eight recent audio language models (7B-30B+ parameters, both open- and closed-source) score 49-77% on pitch-level AC-Verify under two-way forced choice, compared with 88.66% reached by our dual encoder.

Problem

Research questions and friction points this paper is trying to address.

speaker verification

in-the-wild

speaker-level supervision

speech-text corpus

interpretable embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

evidence-grounded speaker verification

structured speaker profiling

tool-first LLM-last pipeline