🤖 AI Summary
Current event extraction evaluation relies on token-level exact matching, erroneously penalizing semantically correct yet syntactically divergent predictions and thus severely underestimating model capabilities. To address this, we propose RAEE—a novel semantic-level evaluation framework for event extraction. RAEE employs large language models as intelligent evaluators, leveraging semantic alignment modeling and adaptive prompting to enable fine-grained, interpretable assessment of triggers and arguments along both precision and recall dimensions. Further, it incorporates multi-granularity human alignment calibration, achieving strong agreement with human judgments (Spearman’s ρ > 0.92). We re-evaluate 14 state-of-the-art models across 10 benchmark datasets, observing consistent F1 improvements of 15–40%. The open-sourced, reproducible RAEE toolkit advances event extraction evaluation from superficial surface-form matching toward semantically grounded, trustworthy assessment.
📝 Abstract
Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real performance. To address this problem, we propose a reliable and semantic evaluation framework for event extraction, named RAEE, which accurately assesses extraction results at semantic-level instead of token-level. Specifically, RAEE leverages large language models (LLMs) as evaluation agents, incorporating an adaptive mechanism to achieve adaptive evaluations for precision and recall of triggers and arguments. Extensive experiments demonstrate that: (1) RAEE achieves a very strong correlation with human judgments; (2) after reassessing 14 models, including advanced LLMs, on 10 datasets, there is a significant performance gap between exact match and RAEE. The exact match evaluation significantly underestimates the performance of existing event extraction models, and in particular underestimates the capabilities of LLMs; (3) fine-grained analysis under RAEE evaluation reveals insightful phenomena worth further exploration. The evaluation toolkit of our proposed RAEE is publicly released.