Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional automatic speech recognition (ASR) systems rely primarily on word error rate (WER) for evaluation, neglecting semantic correctness and lacking human-like interactive correction capabilities. This work proposes an agent-based interactive ASR framework that unifies semantic consistency evaluation and multi-turn interactive error correction within a large language model (LLM)-driven architecture. Specifically, it introduces LLM-as-a-Judge as a semantic-aware evaluator and leverages semantic feedback to iteratively refine recognition outputs. Experimental results on benchmarks including GigaSpeech, WenetSpeech, and ASRU 2019 demonstrate that the proposed approach significantly outperforms existing baselines in both semantic fidelity and interactive correction performance, achieving consistent improvements across both objective and subjective metrics.

Technology Category

Application Category

📝 Abstract
Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.
Problem

Research questions and friction points this paper is trying to address.

automatic speech recognition
semantic coherence
interactive correction
evaluation metric
agentic ASR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive ASR
LLM-as-a-Judge
Semantic Coherence
Agentic Framework
Iterative Refinement
🔎 Similar Papers
No similar papers found.
Peng Wang
Peng Wang
Nanjing University of Aeronautics and Astronautics, Chinese University of Hong Kong
remote sensing image,machine learning
Y
Yanqiao Zhu
X-LANCE Lab, Shanghai Jiao Tong University
Z
Zixuan Jiang
Xi’an Jiaotong University
Q
Qinyuan Chen
Fudan University
X
Xingjian Zhao
Fudan University
X
Xipeng Qiu
Fudan University
W
Wupeng Wang
Tongyi Fun Team, Alibaba Group
Z
Zhifu Gao
Tongyi Fun Team, Alibaba Group
Xiangang Li
Xiangang Li
Unknown affiliation
speech recognitionnatural language processing
K
Kai Yu
X-LANCE Lab, Shanghai Jiao Tong University
Xie Chen
Xie Chen
Shanghai Jiao Tong University <- Microsoft <- Cambridge University
Machine LearningSpeech RecognitionSpeech SynthesisSpeech&Audio Processing