RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the robustness evaluation of large language models (LLMs) against jailbreaking attacks in multi-turn AI-assisted software development scenarios. We propose RedTWIZ, an adaptive multi-turn attack framework whose core innovation is a hierarchical attack planner that jointly enables compositional strategy generation, realistic software-development scenario modeling, and goal-directed dynamic scheduling—integrating adversarial evaluation, a generative attack suite, and vulnerability-feedback-driven attack evolution. Extensive experiments across leading open- and closed-source LLMs demonstrate that RedTWIZ significantly improves multi-turn jailbreaking success rates. The framework systematically exposes critical security vulnerabilities of current LLMs in high-stakes collaborative coding contexts. By providing a reproducible, scenario-aware evaluation benchmark and a principled methodology, RedTWIZ advances both empirical assessment and robustness enhancement of LLMs in developer-facing applications.

Technology Category

Application Category

📝 Abstract
This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework, to audit the robustness of Large Language Models (LLMs) in AI-assisted software development. Our work is driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak conversational strategies; and (3) a hierarchical attack planner, which adaptively plans, serializes, and triggers attacks tailored to specific LLM's vulnerabilities. Together, these contributions form a unified framework -- combining assessment, attack generation, and strategic planning -- to comprehensively evaluate and expose weaknesses in LLMs' robustness. Extensive evaluation is conducted to systematically assess and analyze the performance of the overall system and each component. Experimental results demonstrate that our multi-turn adversarial attack strategies can successfully lead state-of-the-art LLMs to produce unsafe generations, highlighting the pressing need for more research into enhancing LLM's robustness.
Problem

Research questions and friction points this paper is trying to address.

Auditing LLM robustness in AI-assisted software development
Systematically assessing conversational jailbreak vulnerabilities in LLMs
Generating adaptive multi-turn attacks to expose LLM weaknesses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive hierarchical planner for tailored attacks
Multi-turn attack suite for goal-oriented strategies
Unified framework combining assessment and generation
🔎 Similar Papers
No similar papers found.
A
Artur Horal
NOVA University of Lisbon, NOVA LINCS
D
Daniel Pina
NOVA University of Lisbon, NOVA LINCS
H
Henrique Paz
NOVA University of Lisbon, NOVA LINCS
I
Iago Paulo
NOVA University of Lisbon, NOVA LINCS
J
João Soares
NOVA University of Lisbon, NOVA LINCS
Rafael Ferreira
Rafael Ferreira
PhD Student, Nova School of Science and Technology
Conversational AgentsMachine LearningArtificial Intelligence
Diogo Tavares
Diogo Tavares
NOVA School of Science and Technology
Diogo Glória-Silva
Diogo Glória-Silva
4th Year PhD School of Science and Technology, NOVA University,
procedural plan guidancevision and language models
J
João Magalhães
NOVA University of Lisbon, NOVA LINCS
David Semedo
David Semedo
Universidade NOVA de Lisboa
Vision and LanguageDeep Learning for MultimediaConversational AI