🤖 AI Summary
Existing methods struggle to effectively bridge high-level language or visual prompts with low-level dexterous grasping control in cluttered scenes. To address this, we propose the first memory-augmented teacher-student learning framework that enables prompt-driven perception-action co-optimization. Our approach leverages SAM 2 for real-time object sequence detection and implicit modeling of dynamic scene states. It integrates reinforcement learning with a memory-enhanced neural network to support responsive, high-precision grasping under natural language or visual prompts at runtime. Experiments demonstrate significant improvements in cross-prompt generalization and environmental robustness, achieving state-of-the-art performance in complex, cluttered scenarios. The code and demonstration videos are publicly available.
📝 Abstract
Building models responsive to input prompts represents a transformative shift in machine learning. This paradigm holds significant potential for robotics problems, such as targeted manipulation amidst clutter. In this work, we present a novel approach to combine promptable foundation models with reinforcement learning (RL), enabling robots to perform dexterous manipulation tasks in a prompt-responsive manner. Existing methods struggle to link high-level commands with fine-grained dexterous control. We address this gap with a memory-augmented student-teacher learning framework. We use the Segment-Anything 2 (SAM 2) model as a perception backbone to infer an object of interest from user prompts. While detections are imperfect, their temporal sequence provides rich information for implicit state estimation by memory-augmented models. Our approach successfully learns prompt-responsive policies, demonstrated in picking objects from cluttered scenes. Videos and code are available at https://memory-student-teacher.github.io