🤖 AI Summary
This paper introduces the span-level sentiment evidence detection task, which aims to precisely localize textual segments conveying sentiment—contrasting with conventional sentence-level sentiment classification—and thereby supports applications requiring fine-grained sentiment understanding, such as empathetic dialogue systems and clinical decision support. To this end, we construct the first manually annotated, multi-level benchmark (span-level labels for both single sentences and five-sentence paragraphs), systematically evaluating 14 open-source large language models. Our key contribution is reframing sentiment analysis from *discriminating sentiment categories* to *localizing sentiment-supporting evidence*, thereby advancing model interpretability and mechanistic understanding of sentiment expression. Experiments reveal that while certain models approach human performance on single-sentence detection, their accuracy degrades markedly on longer contexts, exposing critical limitations—including keyword dependency and high false-positive rates on neutral text. This benchmark establishes a novel evaluation paradigm and foundational infrastructure for explainable sentiment computation.
📝 Abstract
We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.