🤖 AI Summary
This study addresses limitations in evaluating models for safety-critical NLP tasks by proposing expert agreement—not mere accuracy—as the core evaluation metric. Focusing on traffic accident narrative classification, we systematically benchmark BERT variants, Universal Sentence Encoder (USE), zero-shot classifiers, and LLMs (GPT-4, LLaMA-3, Qwen, Claude), quantifying human–model alignment via Cohen’s Kappa. We further interpret model decisions using PCA and SHAP. Results reveal a critical trade-off: high-accuracy models often rely on positional keywords, diverging from expert reasoning; conversely, LLMs—though lower in absolute accuracy—exhibit superior consistency with experts in temporal reasoning and contextual modeling. This work establishes expert agreement as a foundational dimension for safety-sensitive NLP evaluation, uncovering an inherent tension between accuracy and reliability in real-world deployment.
📝 Abstract
This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models -- including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier -- against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen's Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.