🤖 AI Summary
This work addresses the challenges of unreliable task rewards and cross-domain interference in multi-domain reinforcement learning. To this end, it proposes a Protocol-Aware Generative Reward Model (PA-GRM) and Direction-Aware Capability Subspace Projection (DACSP). PA-GRM generates trajectory-conditioned rewards based on prompt-level evaluation protocols, enabling task-adaptive and comparable assessment of open-ended responses. DACSP explicitly modulates alignment, conflict, and orthogonal update components within the capability subspace by leveraging historical capability directions, thereby mitigating optimization conflicts across domains. Evaluated on Qwen2.5-7B and Qwen3-4B, the proposed approach achieves Total Avg scores of 47.9 and 50.7, respectively, substantially outperforming standard multi-domain RL baselines.
📝 Abstract
Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.