LLM Assistance for Pediatric Depression

📅 2025-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Depression screening in pediatric clinical settings (ages 6–24) remains challenging due to low feasibility of standard tools like PHQ-9. Method: We propose a low-resource, zero-shot large language model (LLM)-driven symptom extraction framework that leverages off-the-shelf LLMs (Flan-T5, Phi-3, Llama 3) to identify depressive symptoms directly from unstructured clinical notes without fine-tuning or labeled data. Crucially, we treat LLM-generated symptom annotations as structured features for downstream classification, eliminating reliance on large annotated corpora or high computational resources. Contribution/Results: This is the first systematic zero-shot evaluation of LLMs for depression symptom identification in pediatric clinical text. Flan-T5 achieves a symptom-level F1-score of 0.65 (0.92 for the rare symptom “sleep problems”), and its extracted features boost classifier accuracy to 0.78—significantly outperforming baselines. The approach offers a deployable, interpretable, and resource-efficient decision-support paradigm for early pediatric depression detection.

Technology Category

Application Category

📝 Abstract
Traditional depression screening methods, such as the PHQ-9, are particularly challenging for children in pediatric primary care due to practical limitations. AI has the potential to help, but the scarcity of annotated datasets in mental health, combined with the computational costs of training, highlights the need for efficient, zero-shot approaches. In this work, we investigate the feasibility of state-of-the-art LLMs for depressive symptom extraction in pediatric settings (ages 6-24). This approach aims to complement traditional screening and minimize diagnostic errors. Our findings show that all LLMs are 60% more efficient than word match, with Flan leading in precision (average F1: 0.65, precision: 0.78), excelling in the extraction of more rare symptoms like"sleep problems"(F1: 0.92) and"self-loathing"(F1: 0.8). Phi strikes a balance between precision (0.44) and recall (0.60), performing well in categories like"Feeling depressed"(0.69) and"Weight change"(0.78). Llama 3, with the highest recall (0.90), overgeneralizes symptoms, making it less suitable for this type of analysis. Challenges include the complexity of clinical notes and overgeneralization from PHQ-9 scores. The main challenges faced by LLMs include navigating the complex structure of clinical notes with content from different times in the patient trajectory, as well as misinterpreting elevated PHQ-9 scores. We finally demonstrate the utility of symptom annotations provided by Flan as features in an ML algorithm, which differentiates depression cases from controls with high precision of 0.78, showing a major performance boost compared to a baseline that does not use these features.
Problem

Research questions and friction points this paper is trying to address.

Depression Screening
Pediatric Practice
Machine Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Depression Screening
Machine Learning Algorithm Enhancement
🔎 Similar Papers
No similar papers found.
M
Mariia Ignashina
Queen Mary University of London, School of Electronic Engineering and Computer Science, London, UK.
Paulina Bondaronek
Paulina Bondaronek
University College London
digital healthevaluationnatural language processing
D
D. Santel
Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, University of Cincinnati, Cincinnati, OH, USA.
J
John P. Pestian
Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, University of Cincinnati, Cincinnati, OH, USA.
Julia Ive
Julia Ive
University College London