TD-TOG Dataset: Benchmarking Zero-Shot and One-Shot Task-Oriented Grasping for Object Generalization

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current task-oriented grasping (TOG) faces three key bottlenecks: scarcity of high-quality real-world datasets, incomplete annotations (e.g., masks or poses only), and absence of generalization evaluation on unseen object subcategories. To address these, we introduce TD-TOG—the first real-world TOG benchmark—comprising 1,449 RGB-D scenes with 30 broad categories and 120 fine-grained subcategories. It provides triple-precision annotations: hand-labeled object masks, functional regions, and planar rectangular grasp poses. Furthermore, we establish a novel subcategory generalization challenge. We also propose Binary-TOG, the first framework integrating zero-shot, text-driven object recognition (via CLIP) with one-shot functional segmentation, enabling cross-subcategory generalization without retraining. On multi-object scenes, Binary-TOG achieves 68.9% task-oriented grasp accuracy. TD-TOG thus serves as the first high-fidelity, real-world benchmark supporting zero-shot and one-shot TOG evaluation.

Technology Category

Application Category

📝 Abstract

Task-oriented grasping (TOG) is an essential preliminary step for robotic task execution, which involves predicting grasps on regions of target objects that facilitate intended tasks. Existing literature reveals there is a limited availability of TOG datasets for training and benchmarking despite large demand, which are often synthetic or have artifacts in mask annotations that hinder model performance. Moreover, TOG solutions often require affordance masks, grasps, and object masks for training, however, existing datasets typically provide only a subset of these annotations. To address these limitations, we introduce the Top-down Task-oriented Grasping (TD-TOG) dataset, designed to train and evaluate TOG solutions. TD-TOG comprises 1,449 real-world RGB-D scenes including 30 object categories and 120 subcategories, with hand-annotated object masks, affordances, and planar rectangular grasps. It also features a test set for a novel challenge that assesses a TOG solution's ability to distinguish between object subcategories. To contribute to the demand for TOG solutions that can adapt and manipulate previously unseen objects without re-training, we propose a novel TOG framework, Binary-TOG. Binary-TOG uses zero-shot for object recognition, and one-shot learning for affordance recognition. Zero-shot learning enables Binary-TOG to identify objects in multi-object scenes through textual prompts, eliminating the need for visual references. In multi-object settings, Binary-TOG achieves an average task-oriented grasp accuracy of 68.9%. Lastly, this paper contributes a comparative analysis between one-shot and zero-shot learning for object generalization in TOG to be used in the development of future TOG solutions.

Problem

Research questions and friction points this paper is trying to address.

Lack of diverse TOG datasets for training and benchmarking

Existing datasets missing key annotations like affordance masks

Need for zero-shot TOG solutions for unseen objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TD-TOG dataset with real-world RGB-D scenes

Proposes Binary-TOG framework for zero-shot object recognition

Uses one-shot learning for affordance recognition

🔎 Similar Papers

No similar papers found.

Authors to Follow