Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM red-teaming methods lack runtime adaptability to model-specific vulnerabilities, hindering efficient identification of safety failures across diverse attack styles (e.g., manipulation, slang). This paper proposes an online adaptive red-teaming framework that integrates multi-armed bandits (MAB) with LoRA-based fine-tuning to construct lightweight expert models—each specialized for a distinct attack style. A rule-based safety discriminator guides the MAB to dynamically select the optimal expert for generating highly readable and semantically diverse adversarial prompts. Crucially, the MAB’s policy selection process itself serves as a diagnostic tool, exposing the spatial distribution of model vulnerabilities. Evaluated on AdvBench, the framework achieves state-of-the-art attack success rate (ASR@10) while significantly reducing prompt perplexity—thereby jointly optimizing attack efficacy and interpretability.

Technology Category

Application Category

📝 Abstract

Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.

Problem

Research questions and friction points this paper is trying to address.

Adapts online to identify model failure modes

Dynamically selects attack styles via bandit policy

Generates unsafe prompts to exploit vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA experts specialized for distinct attack styles

Bandit policy dynamically selects experts at inference

Reinforcement learning rewards unsafe prompt generation

🔎 Similar Papers

No similar papers found.

Authors to Follow