Single Ground Truth Is Not Enough: Adding Flexibility to Aspect-Based Sentiment Analysis Evaluation

πŸ“… 2024-10-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
ABSA evaluation is often constrained by single-reference ground truths, leading to underestimation of model performance due to the high subjectivity and surface-form diversity in aspect/ opinion term annotation. To address this, we propose the first multi-answer, semantics-robust evaluation framework for span extraction tasks. Our method employs an automated pipeline that jointly models semantic equivalence and contextual alignment to generate alternative ground truth sets, enabling matching of multiple syntactically distinct but semantically valid spans. We further introduce Kendall’s Tau to quantify annotator agreement, mitigating annotation bias. Evaluated across multiple ABSA benchmarks, our framework improves evaluation consistency by up to 10 percentage points over conventional single-reference paradigms. It more accurately reveals the true capabilities of large language models, demonstrating substantial gains in both reliability and fairness compared to traditional evaluation approaches.

Technology Category

Application Category

πŸ“ Abstract
Aspect-based sentiment analysis (ABSA) is a challenging task of extracting sentiments along with their corresponding aspects and opinion terms from the text. The inherent subjectivity of span annotation makes variability in the surface forms of extracted terms, complicating the evaluation process. Traditional evaluation methods often constrain ground truths (GT) to a single term, potentially misrepresenting the accuracy of semantically valid predictions that differ in surface form. To address this limitation, we propose a novel and fully automated pipeline that expands existing evaluation sets by adding alternative valid terms for aspect and opinion. Our approach facilitates an equitable assessment of language models by accommodating multiple-answer candidates, resulting in enhanced human agreement compared to single-answer test sets (achieving up to a 10%p improvement in Kendall's Tau score). Experimental results demonstrate that our expanded evaluation set helps uncover the capabilities of large language models (LLMs) in ABSA tasks, which is concealed by the single-answer GT sets. Consequently, our work contributes to the development of a flexible evaluation framework for ABSA by embracing diverse surface forms to span extraction tasks in a cost-effective and reproducible manner. Our code and dataset is open at https://github.com/dudrrm/zoom-in-n-out-absa.
Problem

Research questions and friction points this paper is trying to address.

Addresses variability in aspect-based sentiment analysis evaluation
Proposes automated pipeline for multiple valid term annotations
Enhances evaluation of large language models in ABSA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for evaluation
Expands evaluation with alternative terms
Enhances human agreement in assessment
πŸ”Ž Similar Papers
No similar papers found.