Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of training end-to-end large language models for multi-speaker speech recognition under low-resource conditions, where speaker diarization accuracy is often insufficient. To this end, the authors propose a dual-encoder architecture that separately extracts semantic and speaker-specific features, which are then fused via a feature interleaving mechanism before being fed into the large language model. The approach innovatively incorporates a length-aware speaker ID loss and an adaptive ASR loss threshold strategy to jointly optimize speech recognition and speaker diarization. Evaluated on the AliMeeting and Aishell4 datasets, the proposed system achieves relative improvements of 18% and 24% over the baseline, respectively, demonstrating substantial gains in multi-speaker recognition performance in low-resource scenarios.

📝 Abstract

Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.

Problem

Research questions and friction points this paper is trying to address.

multi-talker speech recognition

automatic speech recognition

speaker diarization

large language models

speaker attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-encoder architecture

feature interleaving

length-aware speaker ID loss