REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Traditional exact match (EM) evaluation is ill-suited for generative models in event argument extraction, as it fails to accommodate semantically equivalent surface variants, implicit arguments (inferable but unexpressed), and distributed arguments (spanning multiple sentences). This work proposes the first human-aligned evaluation framework tailored to generative models, integrating semantic similarity computation, logical consistency verification, multi-granularity argument normalization, and an iterative human calibration feedback mechanism—enabling robust identification of implicit and cross-sentence arguments. Evaluated across six benchmark datasets, our framework achieves an average F1-score improvement of 23.93 points over EM baselines. Human annotator agreement reaches 87.67%, significantly enhancing evaluation reliability, validity, and alignment between human and model judgments.

Technology Category

Application Category

📝 Abstract

Event argument extraction identifies arguments for predefined event roles in text. Traditional evaluations rely on exact match (EM), requiring predicted arguments to match annotated spans exactly. However, this approach fails for generative models like large language models (LLMs), which produce diverse yet semantically accurate responses. EM underestimates performance by disregarding valid variations, implicit arguments (unstated but inferable), and scattered arguments (distributed across a document). To bridge this gap, we introduce Reliable Evaluation framework for Generative event argument extraction (REGen), a framework that better aligns with human judgment. Across six datasets, REGen improves performance by an average of 23.93 F1 points over EM. Human validation further confirms REGen's effectiveness, achieving 87.67% alignment with human assessments of argument correctness.

Problem

Research questions and friction points this paper is trying to address.

Evaluates generative event argument extraction models.

Improves on exact match with semantic accuracy.

Aligns evaluation framework with human judgment.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative event argument extraction

Human-aligned evaluation framework

Improved F1 performance metrics

🔎 Similar Papers

No similar papers found.