Single Ground Truth Is Not Enough: Adding Flexibility to Aspect-Based Sentiment Analysis Evaluation

📅 2024-10-13

📈 Citations: 0

✨ Influential: 0

career value

134K/year

🤖 AI Summary

ABSA evaluation is often constrained by single-reference ground truths, leading to underestimation of model performance due to the high subjectivity and surface-form diversity in aspect/ opinion term annotation. To address this, we propose the first multi-answer, semantics-robust evaluation framework for span extraction tasks. Our method employs an automated pipeline that jointly models semantic equivalence and contextual alignment to generate alternative ground truth sets, enabling matching of multiple syntactically distinct but semantically valid spans. We further introduce Kendall’s Tau to quantify annotator agreement, mitigating annotation bias. Evaluated across multiple ABSA benchmarks, our framework improves evaluation consistency by up to 10 percentage points over conventional single-reference paradigms. It more accurately reveals the true capabilities of large language models, demonstrating substantial gains in both reliability and fairness compared to traditional evaluation approaches.

Technology Category

Application Category

📝 Abstract

Aspect-based sentiment analysis (ABSA) is a challenging task of extracting sentiments along with their corresponding aspects and opinion terms from the text. The inherent subjectivity of span annotation makes variability in the surface forms of extracted terms, complicating the evaluation process. Traditional evaluation methods often constrain ground truths (GT) to a single term, potentially misrepresenting the accuracy of semantically valid predictions that differ in surface form. To address this limitation, we propose a novel and fully automated pipeline that expands existing evaluation sets by adding alternative valid terms for aspect and opinion. Our approach facilitates an equitable assessment of language models by accommodating multiple-answer candidates, resulting in enhanced human agreement compared to single-answer test sets (achieving up to a 10%p improvement in Kendall's Tau score). Experimental results demonstrate that our expanded evaluation set helps uncover the capabilities of large language models (LLMs) in ABSA tasks, which is concealed by the single-answer GT sets. Consequently, our work contributes to the development of a flexible evaluation framework for ABSA by embracing diverse surface forms to span extraction tasks in a cost-effective and reproducible manner. Our code and dataset is open at https://github.com/dudrrm/zoom-in-n-out-absa.

Problem

Research questions and friction points this paper is trying to address.

Addresses variability in aspect-based sentiment analysis evaluation

Proposes automated pipeline for multiple valid term annotations

Enhances evaluation of large language models in ABSA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for evaluation

Expands evaluation with alternative terms

Enhances human agreement in assessment

🔎 Similar Papers

Exploiting Adaptive Contextual Masking for Aspect-Based Sentiment Analysis