Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study addresses limitations in evaluating models for safety-critical NLP tasks by proposing expert agreement—not mere accuracy—as the core evaluation metric. Focusing on traffic accident narrative classification, we systematically benchmark BERT variants, Universal Sentence Encoder (USE), zero-shot classifiers, and LLMs (GPT-4, LLaMA-3, Qwen, Claude), quantifying human–model alignment via Cohen’s Kappa. We further interpret model decisions using PCA and SHAP. Results reveal a critical trade-off: high-accuracy models often rely on positional keywords, diverging from expert reasoning; conversely, LLMs—though lower in absolute accuracy—exhibit superior consistency with experts in temporal reasoning and contextual modeling. This work establishes expert agreement as a foundational dimension for safety-sensitive NLP evaluation, uncovering an inherent tension between accuracy and reliability in real-world deployment.

Technology Category

Application Category

📝 Abstract

This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models -- including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier -- against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen's Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.

Problem

Research questions and friction points this paper is trying to address.

Evaluates DL model accuracy versus expert agreement in crash narratives

Assesses expert alignment of LLMs despite lower accuracy scores

Proposes expert agreement as a metric for safety-critical NLP

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates DL models using expert-labeled data

Employs Cohen's Kappa, PCA, and SHAP techniques

Advocates expert agreement over accuracy metrics

🔎 Similar Papers

No similar papers found.