StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical challenge of cross-lingual lexical emphasis preservation in speech-to-speech translation (S2ST): source-language emphasis conveys intent and emotion, yet labeled data are scarce and phoneme- or word-level alignment is error-prone. We propose an LLM-driven end-to-end emphasis-preserving framework: (1) leveraging large language models to synthesize high-quality cross-lingual emphasis-aligned parallel data; (2) designing an emphasis-aware controllable TTS module that jointly incorporates ASR-derived prosodic boundaries and LLM-generated emphasis labels; and (3) introducing an LLM-as-Judge paradigm for automated, annotation-free quantification of emphasis fidelity. Experiments demonstrate significant improvements over strong baselines in translation quality, speech naturalness, and intent conveyance. Crucially, the method maintains robust emphasis transfer under low-resource conditions, establishing a scalable, emotion-aware S2ST paradigm.

Technology Category

Application Category

📝 Abstract
We propose a stress-aware speech-to-speech translation (S2ST) system that preserves word-level emphasis by leveraging LLMs for cross-lingual emphasis conversion. Our method translates source-language stress into target-language tags that guide a controllable TTS model. To overcome data scarcity, we developed a pipeline to automatically generate aligned training data and introduce the "LLM-as-Judge" for evaluation. Experiments show our approach substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. Our work highlights the importance of prosody in translation and provides an effective, data-efficient solution for preserving paralinguistic cues in S2ST.
Problem

Research questions and friction points this paper is trying to address.

Preserving word-level emphasis in speech-to-speech translation across languages
Overcoming data scarcity for stress-aware translation using automated pipelines
Maintaining translation quality while transferring paralinguistic prosodic cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stress-aware S2ST system preserves word-level emphasis
Uses LLMs for cross-lingual emphasis conversion
Automatically generates aligned training data pipeline
🔎 Similar Papers
No similar papers found.
X
Xi Chen
The Chinese University of Hong Kong, China
Y
Yuchen Song
The Chinese University of Hong Kong, China
Satoshi Nakamura
Satoshi Nakamura
The Chinese University of Hong Kong, Shenzhen
speech and natural language processing