Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Crisis detection in mental health helplines is a highly sensitive, emotion-intensive task requiring fine-grained, multi-faceted assessment. Method: We introduce PsyCrisisBench—the first multi-task, fine-grained evaluation benchmark tailored to real-world helpline transcripts (540 human-annotated utterances), covering emotion recognition, suicidal ideation/plan detection, and risk assessment. We systematically evaluate 64 large language models (LLMs) under standardized protocols. Results: (1) Small-parameter open-weight models (e.g., Qwen2.5-1.5B), after supervised fine-tuning, outperform larger models on emotion and suicidal ideation tasks; (2) AWQ quantization reduces GPU memory usage substantially with negligible performance degradation (best F1 = 0.907); (3) Open-weight LLMs achieve overall performance comparable to closed-weight counterparts, with only emotion recognition showing a statistically significant gap (p = 0.007). This work provides empirical foundations and open resources for lightweight, trustworthy, and deployable LLMs in crisis intervention.

Technology Category

Application Category

📝 Abstract

Psychological support hotlines are critical for crisis intervention but face significant challenges due to rising demand. Large language models (LLMs) could support crisis assessments, yet their capabilities in emotionally sensitive contexts remain unclear. We introduce PsyCrisisBench, a benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. We evaluated 64 LLMs across 15 families (e.g., GPT, Claude, Gemini, Llama, Qwen, DeepSeek) using zero-shot, few-shot, and fine-tuning paradigms. Performance was measured by F1-score, with statistical comparisons via Welch's t-tests. LLMs performed strongly on suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907), improved with few-shot and fine-tuning. Mood status recognition was more challenging (max F1=0.709), likely due to lost vocal cues and ambiguity. A fine-tuned 1.5B-parameter model (Qwen2.5-1.5B) surpassed larger models on mood and suicidal ideation. Open-source models like QwQ-32B performed comparably to closed-source on most tasks (p>0.3), though closed models retained an edge in mood detection (p=0.007). Performance scaled with size up to a point; quantization (AWQ) reduced GPU memory by 70% with minimal F1 degradation. LLMs show substantial promise in structured psychological crisis assessments, especially with fine-tuning. Mood recognition remains limited due to contextual complexity. The narrowing gap between open- and closed-source models, combined with efficient quantization, suggests feasible integration. PsyCrisisBench offers a robust evaluation framework to guide model development and ethical deployment in mental health.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to detect psychological crises in hotline transcripts

Evaluating performance on mood recognition and suicide risk tasks

Comparing open-source vs closed-source models for mental health applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 64 LLMs on PsyCrisisBench benchmark

Used zero-shot, few-shot, fine-tuning paradigms

Achieved strong F1-scores in crisis detection

🔎 Similar Papers

No similar papers found.