RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-generation tasks face challenges in scaling implicit, multi-dimensional evaluation criteria: exhaustive validation is prohibitively costly and incomplete due to the sheer number of criteria, while handcrafted unified reward functions heavily rely on prompt engineering. To address this, we propose AD-RL—an adversarial-discrimination-based reinforcement learning framework—where a large language model serves as a dynamic discriminator that adaptively identifies the most critical failure modes in generated responses and jointly optimizes with the generator. By embedding external validation modules directly into the training loop, AD-RL significantly reduces human annotation overhead and improves reward modeling efficiency. Experiments on factual consistency and code correctness tasks demonstrate that AD-RL substantially outperforms both full-validation baselines and conventional reward models, validating its effectiveness, generalizability, and scalability.

Technology Category

Application Category

📝 Abstract
Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic's error detection and the generator's output quality while reducing required verifications. Our experiments demonstrate that RLAC improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLAC for scaling RL post-training to free-form generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Open-ended generation tasks require satisfying diverse implicit evaluation rubrics
High verification costs and incomplete assessments hinder reinforcement learning scaling
Combining multiple rubrics into single rewards is highly prompt-specific and challenging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic rubric verification reduces required verifications
LLM critic identifies most likely failure modes
Jointly trains generator and critic for quality enhancement
🔎 Similar Papers
No similar papers found.