Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses limitations in evaluating models for safety-critical NLP tasks by proposing expert agreement—not mere accuracy—as the core evaluation metric. Focusing on traffic accident narrative classification, we systematically benchmark BERT variants, Universal Sentence Encoder (USE), zero-shot classifiers, and LLMs (GPT-4, LLaMA-3, Qwen, Claude), quantifying human–model alignment via Cohen’s Kappa. We further interpret model decisions using PCA and SHAP. Results reveal a critical trade-off: high-accuracy models often rely on positional keywords, diverging from expert reasoning; conversely, LLMs—though lower in absolute accuracy—exhibit superior consistency with experts in temporal reasoning and contextual modeling. This work establishes expert agreement as a foundational dimension for safety-sensitive NLP evaluation, uncovering an inherent tension between accuracy and reliability in real-world deployment.

Technology Category

Application Category

📝 Abstract
This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models -- including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier -- against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen's Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.
Problem

Research questions and friction points this paper is trying to address.

Evaluates DL model accuracy versus expert agreement in crash narratives
Assesses expert alignment of LLMs despite lower accuracy scores
Proposes expert agreement as a metric for safety-critical NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates DL models using expert-labeled data
Employs Cohen's Kappa, PCA, and SHAP techniques
Advocates expert agreement over accuracy metrics
🔎 Similar Papers
No similar papers found.