SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

๐Ÿ“… 2026-06-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

200K/year
๐Ÿค– AI Summary
This work addresses the lack of a unified benchmark for speech editing evaluation, which hinders the simultaneous assessment of multi-attribute manipulation and preservation of irrelevant characteristics. To bridge this gap, we propose the first unified evaluation framework supporting multi-attribute editing, composite instructions, and bilingual (Chineseโ€“English) inputs. We construct a benchmark dataset encompassing seven atomic and composite editing tasks and introduce an anchor-based contrastive evaluation protocol with three fine-grained metrics: target success, preservation success, and joint success. Combining human and automatic evaluations, we conduct a comprehensive assessment of leading speech foundation models and specialized systems, revealing critical limitations: imbalanced performance across editing dimensions, superior efficacy of closed-source over open-source models, and notably low joint success rates on composite tasks.
๐Ÿ“ Abstract
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce \textbf{SpeechEditBench}, a bilingual multi-attribute benchmark for instruction-guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor-based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed-source Speech LLMs generally outperform open-source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next-generation Speech LLMs with more robust and precise instruction-guided editing capabilities. Data and code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

instruction-guided speech editing
speech benchmark
attribute preservation
compositional editing
Speech LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-guided speech editing
SpeechEditBench
multi-attribute benchmark
anchor-based evaluation
compositional editing
๐Ÿ”Ž Similar Papers
No similar papers found.