Can Audio Large Language Models Verify Speaker Identity?

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current audio large language models (ALLMs) exhibit weak generalization in zero-shot speaker verification (SV), particularly degrading significantly under variable acoustic conditions. This work reformulates SV as an audio question-answering task, introduces a rule-driven hard sample pair sampling strategy, and combines supervised fine-tuning with lightweight parameter updates. Key contributions include: (i) the first demonstration of ALLMs jointly discriminating speaker identity and linguistic content; and (ii) unified modeling of both text-dependent and zero-shot SV within a single framework. Experiments show substantial improvements in zero-shot SV performance after fine-tuning; in text-dependent settings, accuracy matches that of cascaded ASR-SV systems. These results validate ALLMs as robust, multi-task foundational models for speaker verification.

Technology Category

Application Category

📝 Abstract

This paper investigates adapting Audio Large Language Models (ALLMs) for speaker verification (SV). We reformulate SV as an audio question-answering task and conduct comprehensive zero-shot evaluations on public benchmarks, showing that current ALLMs have limited zero-shot SV capability and often struggle in diverse acoustic conditions. To address this challenge, we perform supervised fine-tuning on speaker verification data. A rule-based hard pair sampling strategy is proposed to construct more challenging training pairs. Lightweight fine-tuning substantially improves the performance, though there is still a gap between ALLMs and conventional models. Then, we extend to text-dependent SV by jointly querying ALLMs to verify speaker identity and spoken content, yielding results competitive with cascaded ASR-SV systems. Our findings demonstrate that with proper adaptation, ALLMs hold substantial potential as a unified model for robust speaker verification systems, while maintaining the general audio understanding capabilities.

Problem

Research questions and friction points this paper is trying to address.

Adapting Audio LLMs for speaker verification via audio question-answering

Addressing limited zero-shot capability in diverse acoustic conditions

Enabling unified speaker and content verification through joint querying

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulated speaker verification as audio question-answering

Proposed rule-based hard pair sampling for training

Jointly queried model for text-dependent speaker verification

🔎 Similar Papers

No similar papers found.

Authors to Follow