🤖 AI Summary
Current audio large language models (ALLMs) exhibit weak generalization in zero-shot speaker verification (SV), particularly degrading significantly under variable acoustic conditions. This work reformulates SV as an audio question-answering task, introduces a rule-driven hard sample pair sampling strategy, and combines supervised fine-tuning with lightweight parameter updates. Key contributions include: (i) the first demonstration of ALLMs jointly discriminating speaker identity and linguistic content; and (ii) unified modeling of both text-dependent and zero-shot SV within a single framework. Experiments show substantial improvements in zero-shot SV performance after fine-tuning; in text-dependent settings, accuracy matches that of cascaded ASR-SV systems. These results validate ALLMs as robust, multi-task foundational models for speaker verification.
📝 Abstract
This paper investigates adapting Audio Large Language Models (ALLMs) for speaker verification (SV). We reformulate SV as an audio question-answering task and conduct comprehensive zero-shot evaluations on public benchmarks, showing that current ALLMs have limited zero-shot SV capability and often struggle in diverse acoustic conditions. To address this challenge, we perform supervised fine-tuning on speaker verification data. A rule-based hard pair sampling strategy is proposed to construct more challenging training pairs. Lightweight fine-tuning substantially improves the performance, though there is still a gap between ALLMs and conventional models. Then, we extend to text-dependent SV by jointly querying ALLMs to verify speaker identity and spoken content, yielding results competitive with cascaded ASR-SV systems. Our findings demonstrate that with proper adaptation, ALLMs hold substantial potential as a unified model for robust speaker verification systems, while maintaining the general audio understanding capabilities.