Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of effective end-to-end management of interaction data in existing reinforcement learning (RL) agent systems, which leads to inefficient trajectory organization and utilization. The authors propose the first step-centric data middleware architecture that captures multi-turn interactions through a unified API and structures trajectories into manageable data assets enriched with metadata such as prompt IDs, response IDs, and rewards. The system features a dual-core design comprising a gateway server and a data pool, enabling seamless integration between heterogeneous execution runtimes and training backends. It further supports trajectory visualization, quality filtering, and on-demand batch construction. By bridging the systemic gap between data production and consumption in agent-based RL, this framework substantially enhances data flow efficiency and usability.
📝 Abstract
Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.
Problem

Research questions and friction points this paper is trying to address.

agentic reinforcement learning
data lifecycle
step-level data
data middleware
agent-environment interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

step-level data middleware
agentic reinforcement learning
data lifecycle management
interactive data curation
LLM-based agents
🔎 Similar Papers
No similar papers found.
D
Daoyu Wang
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
M
Mingyue Cheng
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
Q
Qingchuan Li
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
S
Shuo Yu
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
J
Jie Ouyang
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China
Qi Liu
Qi Liu
University of Science and Technology of China
Data MiningEducational Big DataRecommender SystemsSocial Network Analysis