General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing object representations—such as 3D coordinates or one-hot vectors—suffer from poor generalization, slow convergence, and reliance on specialized hardware. Method: This paper introduces object-agnostic binary masks as a unified visual object representation and establishes a target-conditioned reinforcement learning framework: it enables dense reward generation without prior knowledge of target location; leverages ground-truth mask supervision in simulation and integrates open-vocabulary detection models for real-world transfer; and supports diverse tasks under a single policy. Contribution/Results: The approach achieves 99.9% task success rates in both simulation and on two physical robots. It significantly improves generalization to unseen objects and training efficiency, and—crucially—presents the first end-to-end, mask-driven high-precision grasping control system.

Technology Category

Application Category

📝 Abstract

Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.

Problem

Research questions and friction points this paper is trying to address.

Developing object-agnostic visual goal representations for reinforcement learning

Addressing poor generalization and slow convergence in goal-conditioned RL

Enabling efficient learning without positional information or special cameras

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-agnostic mask representation for visual goals

Dense reward generation without distance calculations

Sim-to-real transfer using pretrained detection models

🔎 Similar Papers

Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data

2023-06-06International Conference on Learning RepresentationsCitations: 4