SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
Existing long-horizon LLM agents rely on external skill generators or runtime skill libraries, leading to engineering complexity, verbose context usage, and deployment latency. This work proposes the SIRI framework, which achieves fully endogenous skill learning for the first time: it leverages GiGPO pretraining to acquire foundational interactive capabilities, autonomously extracts and validates skills from its own successful trajectories, and distills effective skills back into the original policy. Requiring no external tools, this approach significantly enhances performance—boosting Qwen2.5-7B-Instruct to 0.930 on ALFWorld and 0.813 on WebShop—outperforming current baselines while substantially reducing deployment overhead.
📝 Abstract
Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.
Problem

Research questions and friction points this paper is trying to address.

long-horizon LLM agents
external skill generators
skill retrieval
deployment latency
engineering complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-skill mining
skill internalization
reinforcement learning
LLM agents
trajectory distillation