ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the challenge of detecting verbal sarcasm using only audio signals by modeling temporal prosodic incongruity between local prosodic dynamics and a global emotional baseline. The proposed approach employs a dual-encoder architecture to separately extract global affective and fine-grained prosodic features, coupled with an attention-driven incongruity analyzer for classification. Notably, it localizes sarcasm onset without requiring frame-level annotations and incorporates Monte Carlo Dropout to estimate uncertainty, thereby capturing perceptual ambiguity. Evaluated on multiple benchmarks, the method achieves 75.3% F1 on MUStARD++, and demonstrates robust performance on spontaneous speech in PodSarc (62.9% F1) and the cross-lingual MuSaG dataset (65.6% F1). Human evaluations further confirm a strong alignment between model-predicted uncertainty and subjective ambiguity in sarcasm perception.

📝 Abstract

We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen's d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.

Problem

Research questions and friction points this paper is trying to address.

sarcasm recognition

prosody

temporal incongruity

audio-only

emotional baseline

Innovation

Methods, ideas, or system contributions that make the work stand out.

prosodic incongruity

temporal modeling

audio-only sarcasm detection