Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models

📅 2024-10-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current event extraction evaluation relies on token-level exact matching, erroneously penalizing semantically correct yet syntactically divergent predictions and thus severely underestimating model capabilities. To address this, we propose RAEE—a novel semantic-level evaluation framework for event extraction. RAEE employs large language models as intelligent evaluators, leveraging semantic alignment modeling and adaptive prompting to enable fine-grained, interpretable assessment of triggers and arguments along both precision and recall dimensions. Further, it incorporates multi-granularity human alignment calibration, achieving strong agreement with human judgments (Spearman’s ρ > 0.92). We re-evaluate 14 state-of-the-art models across 10 benchmark datasets, observing consistent F1 improvements of 15–40%. The open-sourced, reproducible RAEE toolkit advances event extraction evaluation from superficial surface-form matching toward semantically grounded, trustworthy assessment.

Technology Category

Application Category

📝 Abstract

Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real performance. To address this problem, we propose a reliable and semantic evaluation framework for event extraction, named RAEE, which accurately assesses extraction results at semantic-level instead of token-level. Specifically, RAEE leverages large language models (LLMs) as evaluation agents, incorporating an adaptive mechanism to achieve adaptive evaluations for precision and recall of triggers and arguments. Extensive experiments demonstrate that: (1) RAEE achieves a very strong correlation with human judgments; (2) after reassessing 14 models, including advanced LLMs, on 10 datasets, there is a significant performance gap between exact match and RAEE. The exact match evaluation significantly underestimates the performance of existing event extraction models, and in particular underestimates the capabilities of LLMs; (3) fine-grained analysis under RAEE evaluation reveals insightful phenomena worth further exploration. The evaluation toolkit of our proposed RAEE is publicly released.

Problem

Research questions and friction points this paper is trying to address.

Current event extraction evaluation relies on token-level exact match.

Exact match misjudges semantic-level correct cases, causing performance discrepancy.

Proposed RAEE framework assesses event extraction at semantic-level accurately.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-level evaluation framework RAEE

Uses large language models as evaluators

Adaptive mechanism for precision and recall

🔎 Similar Papers

No similar papers found.

Authors to Follow