🤖 AI Summary
High-quality robot trajectory data are scarce, and human teleoperation is costly, severely limiting the performance of vision-language-action (VLA) models. To address this challenge, this work proposes RDGen, a novel framework that repurposes sim-to-real reinforcement learning policies as structured trajectory generators rather than final control policies. By integrating task parsing from vision-language models with object localization via Grounding DINO, RDGen efficiently produces smooth, high-success demonstration trajectories on real robots. Experimental results demonstrate that the generated data substantially enhance the performance of downstream VLA models, establishing a scalable paradigm for robotic imitation learning.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robot control. However, their performance remains fundamentally constrained by the availability of high-quality robot trajectory data. In current robot learning practice, such data are primarily collected through human teleoperation, which is labor-intensive, costly, and difficult to scale. In this paper, we propose RDGen, a sim-to-real reinforcement learning framework for generating high-quality robot demonstrations. Rather than employing reinforcement learning solely as the final control policy, RDGen leverages trained RL policies as a structured trajectory generator. The system consists of a VLM-based task parser that identifies task-relevant objects, a Grounding DINO-based object localizer, and an RL policy transferred from simulation to the real robot. Successful rollouts are then harvested as clean, high-quality demonstrations for downstream VLA training, while the simulation stage further provides a scalable source of additional trajectories at little marginal cost. Experiments on a pick-and-place task demonstrate that the transferred RL policy achieves a high task success rate. Compared with human teleoperation, RDGen produces significantly smoother trajectories and yields superior downstream VLA performance. These results indicate that RL-generated demonstrations can serve as more reliable and consistent supervisory signals for robot policy learning.