๐ค AI Summary
This study addresses the public health challenge of non-medical opioid use (NMU) by tackling named entity recognition (NER) for two categories of self-reported clinical and social consequences on social media. It reveals a substantial reasoning gap between current AI models and domain experts (Cohenโs kappa = 0.81). To bridge this gap, we introduce RedditImpacts 2.0โa high-quality, expert-annotated dataset emphasizing first-person disclosures and fine-grained annotation guidelines. We comparatively evaluate fine-tuned DeBERTa-large against large language models (LLMs) under zero- and few-shot in-context learning settings. DeBERTa-large achieves a relaxed token-level F1 score of 0.61 and outperforms LLMs in precision, span accuracy, and adherence to annotation guidelines. Results demonstrate that even small amounts of expert-curated training data suffice for robust NER performance; however, a persistent humanโAI discrepancy underscores the indispensable role of deep domain expertise in building trustworthy AI models for sensitive health domains.
๐ Abstract
Nonmedical opioid use is an urgent public health challenge, with far-reaching clinical and social consequences that are often underreported in traditional healthcare settings. Social media platforms, where individuals candidly share first-person experiences, offer a valuable yet underutilized source of insight into these impacts. In this study, we present a named entity recognition (NER) framework to extract two categories of self-reported consequences from social media narratives related to opioid use: ClinicalImpacts (e.g., withdrawal, depression) and SocialImpacts (e.g., job loss). To support this task, we introduce RedditImpacts 2.0, a high-quality dataset with refined annotation guidelines and a focus on first-person disclosures, addressing key limitations of prior work. We evaluate both fine-tuned encoder-based models and state-of-the-art large language models (LLMs) under zero- and few-shot in-context learning settings. Our fine-tuned DeBERTa-large model achieves a relaxed token-level F1 of 0.61 [95% CI: 0.43-0.62], consistently outperforming LLMs in precision, span accuracy, and adherence to task-specific guidelines. Furthermore, we show that strong NER performance can be achieved with substantially less labeled data, emphasizing the feasibility of deploying robust models in resource-limited settings. Our findings underscore the value of domain-specific fine-tuning for clinical NLP tasks and contribute to the responsible development of AI tools that may enhance addiction surveillance, improve interpretability, and support real-world healthcare decision-making. The best performing model, however, still significantly underperforms compared to inter-expert agreement (Cohen's kappa: 0.81), demonstrating that a gap persists between expert intelligence and current state-of-the-art NER/AI capabilities for tasks requiring deep domain knowledge.