CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the challenges of unreliable task rewards and cross-domain interference in multi-domain reinforcement learning. To this end, it proposes a Protocol-Aware Generative Reward Model (PA-GRM) and Direction-Aware Capability Subspace Projection (DACSP). PA-GRM generates trajectory-conditioned rewards based on prompt-level evaluation protocols, enabling task-adaptive and comparable assessment of open-ended responses. DACSP explicitly modulates alignment, conflict, and orthogonal update components within the capability subspace by leveraging historical capability directions, thereby mitigating optimization conflicts across domains. Evaluated on Qwen2.5-7B and Qwen3-4B, the proposed approach achieves Total Avg scores of 47.9 and 50.7, respectively, substantially outperforming standard multi-domain RL baselines.
📝 Abstract
Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
multi-domain
reward unreliability
capability interference
cross-domain conflicts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Protocol-Aware Reward
Capability-Aware Optimization
Cross-Domain Conflict Mitigation
Direction-Aware Subspace Projection
Multi-Domain Reinforcement Learning
🔎 Similar Papers
No similar papers found.