🤖 AI Summary
This work addresses the challenge of constructing robust multilingual LLM safety classifiers—“guardrails”—given the scarcity of high-quality non-English safety data. We propose the first provably convergent two-player reinforcement learning framework, wherein a generator and a guard model co-evolve adversarially to synthesize high-fidelity multilingual safety data end-to-end. Our method integrates PPO-based training, multilingual instruction tuning, and a lightweight architecture (0.5B parameters), formally modeling multilingual guardrail learning as a strictly convergent two-player zero-sum game. Experiments demonstrate that our model outperforms LlamaGuard-3 (8B) by nearly 10% on English benchmarks while achieving 4.5× faster inference. Crucially, it delivers substantial gains in low-resource languages—including Arabic and Spanish—where prior methods suffer from severe data scarcity. To foster reproducibility and community advancement, we fully open-source our code, models, and synthetically generated multilingual safety dataset.
📝 Abstract
The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.