Prompt-responsive Object Retrieval with Memory-augmented Student-Teacher Learning

📅 2025-05-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to effectively bridge high-level language or visual prompts with low-level dexterous grasping control in cluttered scenes. To address this, we propose the first memory-augmented teacher-student learning framework that enables prompt-driven perception-action co-optimization. Our approach leverages SAM 2 for real-time object sequence detection and implicit modeling of dynamic scene states. It integrates reinforcement learning with a memory-enhanced neural network to support responsive, high-precision grasping under natural language or visual prompts at runtime. Experiments demonstrate significant improvements in cross-prompt generalization and environmental robustness, achieving state-of-the-art performance in complex, cluttered scenarios. The code and demonstration videos are publicly available.

Technology Category

Application Category

📝 Abstract
Building models responsive to input prompts represents a transformative shift in machine learning. This paradigm holds significant potential for robotics problems, such as targeted manipulation amidst clutter. In this work, we present a novel approach to combine promptable foundation models with reinforcement learning (RL), enabling robots to perform dexterous manipulation tasks in a prompt-responsive manner. Existing methods struggle to link high-level commands with fine-grained dexterous control. We address this gap with a memory-augmented student-teacher learning framework. We use the Segment-Anything 2 (SAM 2) model as a perception backbone to infer an object of interest from user prompts. While detections are imperfect, their temporal sequence provides rich information for implicit state estimation by memory-augmented models. Our approach successfully learns prompt-responsive policies, demonstrated in picking objects from cluttered scenes. Videos and code are available at https://memory-student-teacher.github.io
Problem

Research questions and friction points this paper is trying to address.

Linking high-level commands to fine-grained dexterous robot control
Enabling prompt-responsive object manipulation in cluttered environments
Improving imperfect object detection via temporal memory-augmented learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines promptable foundation models with reinforcement learning
Uses memory-augmented student-teacher learning framework
Leverages SAM 2 model for prompt-based object detection
🔎 Similar Papers
No similar papers found.