π€ AI Summary
Existing models struggle to effectively track dynamically evolving risk signals in multi-turn psychological crisis conversations, leading to significant performance degradation. This work addresses this limitation by introducing CRADLE-Dialogue, the first clinically grounded, multi-label, temporally sensitive dialogue-level crisis detection benchmark comprising 600 expert-annotated clinical dialogues, along with the Alert-Confirm evaluation protocol that explicitly distinguishes between alert and confirmation phases as well as historical and current risk states. Leveraging a 32B-parameter large language model enhanced with synthetic data generation and contextual modeling techniques, the proposed approach consistently outperforms existing open-source models across turn-level, dialogue-level, and confirm-only evaluation settings, achieving Micro F1 scores of 40%β60%βperformance on par with or exceeding that of proprietary commercial systems.
π Abstract
Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.