What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of disentangling the causal contributions of prosodic cues—such as speech rate, pitch, and loudness—to the perception of verbal irony in natural speech. Leveraging prompt-based controllable neural text-to-speech synthesis, the authors construct an orthogonal stimulus set that independently manipulates each prosodic dimension. Combining human subjective judgments with analyses from audio foundation models, the work systematically investigates how these cues are weighted in irony detection. Results reveal that human listeners primarily rely on loudness to identify irony, whereas foundation models place greater emphasis on speech rate, indicating markedly divergent perceptual strategies. This finding offers a novel perspective on the differences between human and artificial intelligence in social-pragmatic reasoning.
📝 Abstract
Prosody plays a central role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.
Problem

Research questions and friction points this paper is trying to address.

sarcasm perception
prosody
speech synthesis
acoustic control
cue weighting
Innovation

Methods, ideas, or system contributions that make the work stand out.

controllable neural TTS
prosody manipulation
sarcasm perception
orthogonal stimulus design
cue weighting
🔎 Similar Papers
No similar papers found.