ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of zero-shot 6D object pose estimation in robotic manipulation under visuotactile data scarcity. We propose an online optimization framework that requires no visuotactile training data. Methodologically, we model gripper-object interaction as a mass-spring physical system and integrate DINOv2-derived visual priors with proprioceptive inputs from tactile sensors and joint encoders. During inference, we jointly constrain optimization via tactile attraction forces and proprioceptive repulsion forces, enabling real-time gradient-based pose refinement. This zero-shot paradigm significantly improves pose estimation accuracy: real-robot experiments demonstrate a 55% increase in ADD-S AUC, a 60% improvement in ADD, and an 80% reduction in positional error—substantially enhancing robustness in object tracking during grasping.

Technology Category

Application Category

📝 Abstract

Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot visuotactile 6D pose estimation for robotics

Overcoming generalization limits in visuotactile data

Improving accuracy via physical constraints and test-time optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot visuotactile pose estimation framework

Spring-mass system models gripper-object interaction

Test-time optimization with physical constraints

🔎 Similar Papers

No similar papers found.