DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of constructing robust multilingual LLM safety classifiers—“guardrails”—given the scarcity of high-quality non-English safety data. We propose the first provably convergent two-player reinforcement learning framework, wherein a generator and a guard model co-evolve adversarially to synthesize high-fidelity multilingual safety data end-to-end. Our method integrates PPO-based training, multilingual instruction tuning, and a lightweight architecture (0.5B parameters), formally modeling multilingual guardrail learning as a strictly convergent two-player zero-sum game. Experiments demonstrate that our model outperforms LlamaGuard-3 (8B) by nearly 10% on English benchmarks while achieving 4.5× faster inference. Crucially, it delivers substantial gains in low-resource languages—including Arabic and Spanish—where prior methods suffer from severe data scarcity. To foster reproducibility and community advancement, we fully open-source our code, models, and synthetically generated multilingual safety dataset.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.
Problem

Research questions and friction points this paper is trying to address.

Addressing multilingual safety data scarcity
Improving LLM guardrail model efficiency
Enhancing guardrail performance for low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-player Reinforcement Learning framework
Synthetic data generation for multilingual training
Nash equilibrium in adversarial co-evolution
🔎 Similar Papers
No similar papers found.
Yihe Deng
Yihe Deng
University of California, Los Angeles
Machine LearningNatural Language Processing
Y
Yu Yang
University of California, Los Angeles; VirtueAI
J
Junkai Zhang
University of California, Los Angeles
W
Wei Wang
University of California, Los Angeles
B
Bo Li
VirtueAI; University of Illinois at Urbana-Champaign