🤖 AI Summary
This study addresses the lack of systematic understanding regarding key architectural choices in conversational emotion recognition and the neglect of pragmatic-linguistic mechanisms underlying emotional expression. Through ablation studies and discourse marker analysis on the IEMOCAP dataset, we investigate the practical contributions of contextual information, intra-utterance structure, and sentiment lexicons. Our findings reveal that performance saturates within 10–30 dialogue turns, intra-utterance hierarchical representations become ineffective once context is incorporated, and sentiment lexicons yield no significant gains. Notably, we discover a significant reduction in left-peripheral discourse markers in utterances expressing sadness (21.9% vs. 28–32%, p<0.0001), elucidating its strong contextual dependency. Using only a causal contextual architecture, our model achieves weighted F1 scores of 82.69% (4-class) and 67.07% (6-class), surpassing prior text-only approaches.
📝 Abstract
Despite strong recent progress in Emotion Recognition in Conversation (ERC), two gaps remain: we lack clear understanding of which modeling choices materially affect performance, and we have limited linguistic analysis linking recognition findings to actionable generation cues. We address both via a systematic study on IEMOCAP. For recognition, we conduct controlled ablations with 10 random seeds and paired tests (with correction for multiple comparisons), yielding three findings. First, conversational context is dominant: performance saturates quickly, with roughly 90% of gain achieved using only the most recent 10-30 preceding turns. Second, hierarchical sentence representations improve utterance-only recognition (K=0), but the benefit vanishes once turn-level context is available, suggesting conversational history subsumes intra-utterance structure. Third, external affective lexicon (SenticNet) integration does not improve results, consistent with pretrained encoders already capturing affective signal. Under strictly causal (past-only) setting, our simple models attain strong performance (82.69% 4-way; 67.07% 6-way weighted F1). For linguistic analysis, we examine 5,286 discourse-marker occurrences and find reliable association between emotion and marker position (p<0.0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), aligning with accounts linking left-periphery markers to active discourse management. This pattern is consistent with Sad benefiting most from conversational context (+22%p), suggesting sadness relies more on discourse history than overt pragmatic signaling.