Inference Gap in Domain Expertise and Machine Intelligence in Named Entity Recognition: Creation of and Insights from a Substance Use-related Dataset

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the public health challenge of non-medical opioid use (NMU) by tackling named entity recognition (NER) for two categories of self-reported clinical and social consequences on social media. It reveals a substantial reasoning gap between current AI models and domain experts (Cohen’s kappa = 0.81). To bridge this gap, we introduce RedditImpacts 2.0—a high-quality, expert-annotated dataset emphasizing first-person disclosures and fine-grained annotation guidelines. We comparatively evaluate fine-tuned DeBERTa-large against large language models (LLMs) under zero- and few-shot in-context learning settings. DeBERTa-large achieves a relaxed token-level F1 score of 0.61 and outperforms LLMs in precision, span accuracy, and adherence to annotation guidelines. Results demonstrate that even small amounts of expert-curated training data suffice for robust NER performance; however, a persistent human–AI discrepancy underscores the indispensable role of deep domain expertise in building trustworthy AI models for sensitive health domains.

Technology Category

Application Category

📝 Abstract

Nonmedical opioid use is an urgent public health challenge, with far-reaching clinical and social consequences that are often underreported in traditional healthcare settings. Social media platforms, where individuals candidly share first-person experiences, offer a valuable yet underutilized source of insight into these impacts. In this study, we present a named entity recognition (NER) framework to extract two categories of self-reported consequences from social media narratives related to opioid use: ClinicalImpacts (e.g., withdrawal, depression) and SocialImpacts (e.g., job loss). To support this task, we introduce RedditImpacts 2.0, a high-quality dataset with refined annotation guidelines and a focus on first-person disclosures, addressing key limitations of prior work. We evaluate both fine-tuned encoder-based models and state-of-the-art large language models (LLMs) under zero- and few-shot in-context learning settings. Our fine-tuned DeBERTa-large model achieves a relaxed token-level F1 of 0.61 [95% CI: 0.43-0.62], consistently outperforming LLMs in precision, span accuracy, and adherence to task-specific guidelines. Furthermore, we show that strong NER performance can be achieved with substantially less labeled data, emphasizing the feasibility of deploying robust models in resource-limited settings. Our findings underscore the value of domain-specific fine-tuning for clinical NLP tasks and contribute to the responsible development of AI tools that may enhance addiction surveillance, improve interpretability, and support real-world healthcare decision-making. The best performing model, however, still significantly underperforms compared to inter-expert agreement (Cohen's kappa: 0.81), demonstrating that a gap persists between expert intelligence and current state-of-the-art NER/AI capabilities for tasks requiring deep domain knowledge.

Problem

Research questions and friction points this paper is trying to address.

Extracting clinical and social impacts from opioid-related social media narratives

Addressing the inference gap between domain expertise and machine intelligence

Evaluating NER models for substance use-related dataset annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned DeBERTa-large model for NER

RedditImpacts 2.0 dataset with refined annotations

Few-shot learning with reduced labeled data

🔎 Similar Papers

No similar papers found.

Authors to Follow