🤖 AI Summary
This study addresses a critical gap in the safety evaluation of mental health chatbots, which has predominantly focused on single-turn crisis responses while neglecting relational risks that emerge over multi-turn interactions and may adversely affect users’ long-term well-being. To this end, the authors propose a reproducible, API-free adversarial multi-agent simulation framework that integrates dialogue trajectory analysis with clinical psychology theory to systematically identify 23 relational safety failure modes—such as “empathy fatigue” and “identity spirals.” Building upon these findings, they construct the first clinically grounded safety pattern library and translate it into actionable design guidelines tailored for developers, clinicians, and policymakers. This work substantially advances the capacity to understand, anticipate, and mitigate safety risks inherent in prolonged human–chatbot interactions within mental health contexts.
📝 Abstract
As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like "validation spirals" where chatbots progressively reinforce hopelessness, or "empathy fatigue" where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.