Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Crisis detection in mental health helplines is a highly sensitive, emotion-intensive task requiring fine-grained, multi-faceted assessment. Method: We introduce PsyCrisisBench—the first multi-task, fine-grained evaluation benchmark tailored to real-world helpline transcripts (540 human-annotated utterances), covering emotion recognition, suicidal ideation/plan detection, and risk assessment. We systematically evaluate 64 large language models (LLMs) under standardized protocols. Results: (1) Small-parameter open-weight models (e.g., Qwen2.5-1.5B), after supervised fine-tuning, outperform larger models on emotion and suicidal ideation tasks; (2) AWQ quantization reduces GPU memory usage substantially with negligible performance degradation (best F1 = 0.907); (3) Open-weight LLMs achieve overall performance comparable to closed-weight counterparts, with only emotion recognition showing a statistically significant gap (p = 0.007). This work provides empirical foundations and open resources for lightweight, trustworthy, and deployable LLMs in crisis intervention.

Technology Category

Application Category

📝 Abstract
Psychological support hotlines are critical for crisis intervention but face significant challenges due to rising demand. Large language models (LLMs) could support crisis assessments, yet their capabilities in emotionally sensitive contexts remain unclear. We introduce PsyCrisisBench, a benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. We evaluated 64 LLMs across 15 families (e.g., GPT, Claude, Gemini, Llama, Qwen, DeepSeek) using zero-shot, few-shot, and fine-tuning paradigms. Performance was measured by F1-score, with statistical comparisons via Welch's t-tests. LLMs performed strongly on suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907), improved with few-shot and fine-tuning. Mood status recognition was more challenging (max F1=0.709), likely due to lost vocal cues and ambiguity. A fine-tuned 1.5B-parameter model (Qwen2.5-1.5B) surpassed larger models on mood and suicidal ideation. Open-source models like QwQ-32B performed comparably to closed-source on most tasks (p>0.3), though closed models retained an edge in mood detection (p=0.007). Performance scaled with size up to a point; quantization (AWQ) reduced GPU memory by 70% with minimal F1 degradation. LLMs show substantial promise in structured psychological crisis assessments, especially with fine-tuning. Mood recognition remains limited due to contextual complexity. The narrowing gap between open- and closed-source models, combined with efficient quantization, suggests feasible integration. PsyCrisisBench offers a robust evaluation framework to guide model development and ethical deployment in mental health.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to detect psychological crises in hotline transcripts
Evaluating performance on mood recognition and suicide risk tasks
Comparing open-source vs closed-source models for mental health applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 64 LLMs on PsyCrisisBench benchmark
Used zero-shot, few-shot, fine-tuning paradigms
Achieved strong F1-scores in crisis detection
🔎 Similar Papers
No similar papers found.
G
Guifeng Deng
Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, School of Brain Science and Brain Medicine, and Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China; College of Biomedical Engineering & Instrument Science, Zhejiang University, Hangzhou, 310058, China
S
Shuyin Rao
Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, School of Brain Science and Brain Medicine, and Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
Tianyu Lin
Tianyu Lin
Johns Hopkins University
Medical Image AnalysisComputer Vision
P
Pan Wang
Department of Psychiatry and Mental Health, Wenzhou Medical University, Wenzhou 325035, Zhejiang Province, China
J
Junyi Xie
Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, School of Brain Science and Brain Medicine, and Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
H
Haidong Song
Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, School of Brain Science and Brain Medicine, and Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
Ke Zhao
Ke Zhao
Department of Psychiatry and Mental Health, Wenzhou Medical University, Wenzhou 325035, Zhejiang Province, China
D
Dongwu Xu
Department of Psychiatry and Mental Health, Wenzhou Medical University, Wenzhou 325035, Zhejiang Province, China
Z
Zhengdong Cheng
College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
T
Tao Li
Affiliated Mental Health Center & Hangzhou Seventh People’s Hospital, School of Brain Science and Brain Medicine, and Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China; Department of Psychiatry and Mental Health, Wenzhou Medical University, Wenzhou 325035, Zhejiang Province, China; MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou 311121, China
Haiteng Jiang
Haiteng Jiang
MOE Frontier Science Center for Brain Science and Brain-Machine Integration, Zhejiang University
NeuroengineeringMachine Learning,Cognitive Neuroscience