A Unified and Reproducible Experimentation Framework for Speech Understanding

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the challenge of incomparable evaluations and irreproducible results in speech understanding models, which often arise from discrepancies in post-processing, data handling, and pipeline design during deployment-oriented model selection. To this end, the authors propose SURE, a unified experimental framework that enables fair evaluation across diverse paradigms—from conventional pipelines to speech large language models—under realistic acoustic and linguistic stressors. SURE achieves this through standardized prediction formats, consistent normalization strategies, and a unified scoring mechanism. Furthermore, it introduces an agent-assisted training conversion pipeline that automatically maps published code into versioned, executable training workflows. This study presents the first unified and reproducible approach for both evaluating and training speech understanding systems across modeling paradigms, substantially enhancing comparability and reproducibility in real-world deployment scenarios.
📝 Abstract
Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on representative tasks under realistic acoustic and linguistic stressors. Beyond evaluation, SURE introduces an agent-assisted training conversion flow that maps paper and code into versioned, runnable training pipelines under a unified protocol on matched open-data subsets. Overall, SURE improves comparability and reproducibility for deployment-oriented evaluation.
Problem

Research questions and friction points this paper is trying to address.

speech understanding
model evaluation
reproducibility
comparability
deployment-oriented selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified evaluation framework
reproducible training pipeline
speech foundation models
agent-assisted conversion
standardized scoring
🔎 Similar Papers
No similar papers found.
Jing Peng
Jing Peng
Shanghai Jiao Tong University
Automatic Speech RecognitionSpeech Large Language Model
J
Junhao Du
AISpeech Ltd, Suzhou, China
Chenghao Wang
Chenghao Wang
Northeastern University
Robotics
H
Hanqi Li
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China
Y
Yi Yang
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China
Y
Yixuan Wang
AISpeech Ltd, Suzhou, China
X
Xiaoyu Gu
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China
G
Guanyu Chen
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China
Yucheng Wang
Yucheng Wang
ETH Zürich
Multimodal LLMSpeech UnderstandingHuman-Computer Interaction
J
Jiang Li
Hangzhou Dianzi University, Hangzhou, China
Z
Zhangjie Zhao
Hangzhou Dianzi University, Hangzhou, China
H
Haoran Wang
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China
W
Wenming Tu
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China
Haoyu Li
Haoyu Li
X-LANCE Lab, Shanghai Jiao Tong University
Speech RecognitionSpeech EnhancementSpeech Synthesis
Duo Ma
Duo Ma
Chinese University of Hong Kong (Shenzhen)
speech recognition self-supervised learning
L
Lirong Qian
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China
Y
Yu Xi
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China
Wen Wen
Wen Wen
Shanghai Jiao Tong University
Signal processing
J
Jiaqi Guo
AISpeech Ltd, Suzhou, China
H
Hui Zhang
AISpeech Ltd, Suzhou, China
Shuai Fan
Shuai Fan
Chengdu University of Technology
机器人机构学
Wenbin Jiang
Wenbin Jiang
Hangzhou Dianzi University
Speech ProcessingSpeech EnhancementSpeech Recognition
Shuai Wang
Shuai Wang
Nanjing University
AI
K
Kai Yu
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence; Jiangsu Key Lab of Language Computing, China