Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing hybrid supervised fine-tuning (SFT) and reinforcement learning (RL) approaches, which lack a mechanism to dynamically allocate training stages according to the intrinsic learning demands of data, often leading to optimization interference. Drawing on schema theory, the authors propose PRISM, a novel framework that, for the first time, links cognitive conflict with gradient space concentration. By analyzing the geometric structure of gradients, PRISM identifies high-conflict samples—those requiring knowledge restructuring—and routes them to RL, while low-conflict samples are efficiently consolidated via SFT. This dynamic data arbitration mechanism transcends the constraints of conventional heuristic strategies, achieving Pareto improvements over prior methods on WebShop and ALFWorld benchmarks while reducing computational costs by up to 3.22×.

Technology Category

Application Category

📝 Abstract
While Hybrid Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface-level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state-of-the-art hybrid methods while reducing computational costs by up to 3.22$\times$. Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
Problem

Research questions and friction points this paper is trying to address.

Supervised Fine-Tuning
Reinforcement Learning
Data Allocation
Optimization Interference
LLM Agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

PRISM
gradient concentration
cognitive conflict
hybrid SFT-RL
schema theory
🔎 Similar Papers
No similar papers found.
Yang Zhao
Yang Zhao
Research Professor, Zhejiang University, China
Intelligent BuildingSmart GridFault detection and diagnosisEnergy efficiency
Y
Yangou Ouyang
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Xiao Ding
Xiao Ding
Harbin Institute of Technology
Natural Language ProcessingArtificial Intelligence
H
Hepeng Wang
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Bibo Cai
Bibo Cai
Harbin Institute Technology
NLP
Kai Xiong
Kai Xiong
Harbin Institute of Technology
Event-Centric ReasoningLarge Language ModelsEvent Graph
J
Jin-Fang Gao
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Zhouhao Sun
Zhouhao Sun
Harbin Institute of Technology
NLP
Li Du
Li Du
BAAI
LLMNLPData ScienceInterpretable AI
Bing Qin
Bing Qin
Professor in Harbin Institute of Technology
Natural Language ProcessingInformation ExtractionSentiment Analysis
T
Ting Liu
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China