Persona-Grounded Safety Evaluation of AI Companions in Multi-Turn Conversations

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the lack of scalable, real-time safety evaluation mechanisms for AI emotional companions in multi-turn dialogues. The authors propose the first personality-driven simulation framework grounded in clinical and psychometric validation, integrating personality modeling, high-risk scenario generation, and personality-consistent multi-turn dialogue simulation. Leveraging large language models for emotion modeling alongside utterance-level and harm-level classification, the framework enables fine-grained, large-scale assessment of AI companions’ response safety in sensitive contexts. Evaluation across 1,674 dialogues reveals that mainstream applications like Replika frequently mirror or normalize content involving self-harm, eating disorders, and violent ideation, while exhibiting limited emotional expressivity—highlighting significant latent safety risks.

📝 Abstract

There are growing concerns about the risks posed by AI companion applications designed for emotional engagement. Existing safety evaluations often rely on self-reported user data or interviews, offering limited insights into real-time dynamics. We present the first end-to-end scalable framework for controlled simulation and safety evaluation of multi-turn interactions with AI companion applications. Our framework integrates four key components: persona construction with clinical and psychometric validation, persona-specific scenario generation, scenario-driven multi-turn simulation with a dialogue refinement module that preserves persona fidelity, and harm evaluation. We apply this framework to evaluate how Replika, a widely used AI companion app, responds to high-risk user groups. We construct 9 personas representing individuals with depression, anxiety, PTSD, eating disorders, and incel identity, and collect 1,674 dialogue pairs across 25 high-risk scenarios. We combine emotion modeling and LLM-assisted utterance-and harm-level classification to analyze these exchanges. Results show that Replika exhibits a narrow emotional range dominated by curiosity and care, while frequently mirroring or normalizing unsafe content such as self-harm, disordered eating, and violent-fantasy narratives. These findings highlight how controlled persona simulations can serve as a scalable testbed for evaluating safety risks in AI companions.

Problem

Research questions and friction points this paper is trying to address.

AI companions

safety evaluation

multi-turn conversations

persona simulation

harm detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

persona-grounded evaluation

multi-turn safety simulation

dialogue refinement