🤖 AI Summary
Crisis detection in mental health helplines is a highly sensitive, emotion-intensive task requiring fine-grained, multi-faceted assessment. Method: We introduce PsyCrisisBench—the first multi-task, fine-grained evaluation benchmark tailored to real-world helpline transcripts (540 human-annotated utterances), covering emotion recognition, suicidal ideation/plan detection, and risk assessment. We systematically evaluate 64 large language models (LLMs) under standardized protocols. Results: (1) Small-parameter open-weight models (e.g., Qwen2.5-1.5B), after supervised fine-tuning, outperform larger models on emotion and suicidal ideation tasks; (2) AWQ quantization reduces GPU memory usage substantially with negligible performance degradation (best F1 = 0.907); (3) Open-weight LLMs achieve overall performance comparable to closed-weight counterparts, with only emotion recognition showing a statistically significant gap (p = 0.007). This work provides empirical foundations and open resources for lightweight, trustworthy, and deployable LLMs in crisis intervention.
📝 Abstract
Psychological support hotlines are critical for crisis intervention but face significant challenges due to rising demand. Large language models (LLMs) could support crisis assessments, yet their capabilities in emotionally sensitive contexts remain unclear. We introduce PsyCrisisBench, a benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. We evaluated 64 LLMs across 15 families (e.g., GPT, Claude, Gemini, Llama, Qwen, DeepSeek) using zero-shot, few-shot, and fine-tuning paradigms. Performance was measured by F1-score, with statistical comparisons via Welch's t-tests. LLMs performed strongly on suicidal ideation detection (F1=0.880), suicide plan identification (F1=0.779), and risk assessment (F1=0.907), improved with few-shot and fine-tuning. Mood status recognition was more challenging (max F1=0.709), likely due to lost vocal cues and ambiguity. A fine-tuned 1.5B-parameter model (Qwen2.5-1.5B) surpassed larger models on mood and suicidal ideation. Open-source models like QwQ-32B performed comparably to closed-source on most tasks (p>0.3), though closed models retained an edge in mood detection (p=0.007). Performance scaled with size up to a point; quantization (AWQ) reduced GPU memory by 70% with minimal F1 degradation. LLMs show substantial promise in structured psychological crisis assessments, especially with fine-tuning. Mood recognition remains limited due to contextual complexity. The narrowing gap between open- and closed-source models, combined with efficient quantization, suggests feasible integration. PsyCrisisBench offers a robust evaluation framework to guide model development and ethical deployment in mental health.