RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts

📅 2025-01-29

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses two critical safety challenges in chatbots—jailbreaking attacks and anthropomorphic misattribution—by introducing the first Korean red-teaming dataset derived from authentic, user-initiated interactions (609 jailbreak-oriented prompts) extracted from a Reddit-like Korean online community. It innovatively formalizes emergent, non-normative interaction patterns—including “taming,” “intimacy probing,” and “jailbreak games”—as systematic red-teaming signals, thereby bridging the gap in real-world, behavior-driven LLM safety evaluation. The methodology integrates web crawling of community dialogues, expert human annotation, and fine-grained intent classification to establish a joint dialogue-act–testing-intent analytical framework. The publicly released, open-source dataset substantially improves model detection of latent jailbreaking intentions and provides an empirically grounded, reproducible benchmark for secure conversational AI design.

Technology Category

Application Category

📝 Abstract

User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as"jailbreaking."Moreover, with CAs that possess highly human-like qualities, users show a tendency toward initiating intimate sexual interactions or attempting to tame their chatbots. To capture and reflect these in-the-wild interactions into chatbot designs, we propose RICoTA, a Korean red teaming dataset that consists of 609 prompts challenging LLMs with in-the-wild user-made dialogues capturing jailbreak attempts. We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community, containing specific testing and gaming intentions with a social chatbot. With these prompts, we aim to evaluate LLMs' ability to identify the type of conversation and users' testing purposes to derive chatbot design implications for mitigating jailbreaking risks. Our dataset will be made publicly available via GitHub.

Problem

Research questions and friction points this paper is trying to address.

Safe Chatbot Design

Escape Prevention

Human Impersonation Avoidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

RICoTA Dataset

Chatbot Safety

Language Model Limit Testing

🔎 Similar Papers

No similar papers found.