Inference Gap in Domain Expertise and Machine Intelligence in Named Entity Recognition: Creation of and Insights from a Substance Use-related Dataset

๐Ÿ“… 2025-08-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

169K/year
๐Ÿค– AI Summary
This study addresses the public health challenge of non-medical opioid use (NMU) by tackling named entity recognition (NER) for two categories of self-reported clinical and social consequences on social media. It reveals a substantial reasoning gap between current AI models and domain experts (Cohenโ€™s kappa = 0.81). To bridge this gap, we introduce RedditImpacts 2.0โ€”a high-quality, expert-annotated dataset emphasizing first-person disclosures and fine-grained annotation guidelines. We comparatively evaluate fine-tuned DeBERTa-large against large language models (LLMs) under zero- and few-shot in-context learning settings. DeBERTa-large achieves a relaxed token-level F1 score of 0.61 and outperforms LLMs in precision, span accuracy, and adherence to annotation guidelines. Results demonstrate that even small amounts of expert-curated training data suffice for robust NER performance; however, a persistent humanโ€“AI discrepancy underscores the indispensable role of deep domain expertise in building trustworthy AI models for sensitive health domains.

Technology Category

Application Category

๐Ÿ“ Abstract
Nonmedical opioid use is an urgent public health challenge, with far-reaching clinical and social consequences that are often underreported in traditional healthcare settings. Social media platforms, where individuals candidly share first-person experiences, offer a valuable yet underutilized source of insight into these impacts. In this study, we present a named entity recognition (NER) framework to extract two categories of self-reported consequences from social media narratives related to opioid use: ClinicalImpacts (e.g., withdrawal, depression) and SocialImpacts (e.g., job loss). To support this task, we introduce RedditImpacts 2.0, a high-quality dataset with refined annotation guidelines and a focus on first-person disclosures, addressing key limitations of prior work. We evaluate both fine-tuned encoder-based models and state-of-the-art large language models (LLMs) under zero- and few-shot in-context learning settings. Our fine-tuned DeBERTa-large model achieves a relaxed token-level F1 of 0.61 [95% CI: 0.43-0.62], consistently outperforming LLMs in precision, span accuracy, and adherence to task-specific guidelines. Furthermore, we show that strong NER performance can be achieved with substantially less labeled data, emphasizing the feasibility of deploying robust models in resource-limited settings. Our findings underscore the value of domain-specific fine-tuning for clinical NLP tasks and contribute to the responsible development of AI tools that may enhance addiction surveillance, improve interpretability, and support real-world healthcare decision-making. The best performing model, however, still significantly underperforms compared to inter-expert agreement (Cohen's kappa: 0.81), demonstrating that a gap persists between expert intelligence and current state-of-the-art NER/AI capabilities for tasks requiring deep domain knowledge.
Problem

Research questions and friction points this paper is trying to address.

Extracting clinical and social impacts from opioid-related social media narratives
Addressing the inference gap between domain expertise and machine intelligence
Evaluating NER models for substance use-related dataset annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned DeBERTa-large model for NER
RedditImpacts 2.0 dataset with refined annotations
Few-shot learning with reduced labeled data
๐Ÿ”Ž Similar Papers
No similar papers found.