🤖 AI Summary
Existing vision-language-action (VLA) models suffer from high single-step inference latency—up to minutes—severely hindering real-time robotic deployment. To address this, we propose RT-cache, the first trajectory memory and retrieval framework tailored for VLA models. RT-cache constructs a scalable cache of multi-step motion primitives and enables scene-aware, cross-task trajectory retrieval to efficiently reuse historically successful experiences. Its core innovation is a trajectory memory pipeline supporting large-scale unsupervised experience accumulation, semantic-aligned cross-scene trajectory replay, and rapid zero-shot adaptation with minimal new samples. Evaluated on real-world benchmarks including Open-X Embodiment, RT-cache reduces average task completion time by 58% and improves success rate by 12.3% over retrieval-free baselines, demonstrating substantial gains in both efficiency and effectiveness.
📝 Abstract
This paper introduces RT-cache, a novel trajectorymemory pipeline that accelerates real-world robot inference by leveraging big-data retrieval and learning from experience. While modern Vision-Language-Action (VLA) models can handle diverse robotic tasks, they often incur high per-step inference costs, resulting in significant latency, sometimes minutes per task. In contrast, RT-cache stores a large-scale Memory of previously successful robot trajectories and retrieves relevant multistep motion snippets, drastically reducing inference overhead. By integrating a Memory Builder with a Trajectory Retrieval, we develop an efficient retrieval process that remains tractable even for extremely large datasets. RT-cache flexibly accumulates real-world experiences and replays them whenever the current scene matches past states, adapting quickly to new or unseen environments with only a few additional samples. Experiments on the Open-X Embodiment Dataset and other real-world data demonstrate that RT-cache completes tasks both faster and more successfully than a baseline lacking retrieval, suggesting a practical, data-driven solution for real-time manipulation.