AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This study addresses the lack of standardized safety evaluation benchmarks for AI companion systems by constructing the first publicly available, fine-grained safety risk-annotated dataset, comprising 2,123 real-world Reddit dialogues and covering nine distinct risk categories. The authors propose a novel risk taxonomy tailored to AI companions and employ a human-in-the-loop annotation pipeline to ensure label quality. Leveraging the LLM-as-judge paradigm, they evaluate the safety judgment capabilities of 20 prominent large language models. Results reveal that while these models achieve generally high accuracy, they exhibit significant limitations in detecting subtle risks—such as manipulative behavior—and in avoiding false positives on benign conversations, thereby exposing critical gaps in their current safety reasoning capabilities.
📝 Abstract
As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx
Problem

Research questions and friction points this paper is trying to address.

AI companion safety
unsafe human-AI interactions
LLMs-as-judges
safety risk detection
implicit harm
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI companion safety
LLM-as-judge
benchmark dataset
fine-grained risk annotation
unsafe interaction detection