ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor robustness and limited generalization of image-based visual servoing (IBVS) under occlusion and environmental variations, this paper proposes a task- and object-agnostic visual servoing method. The core innovation lies in the first integration of semantic features from a pre-trained Vision Transformer (ViT) into the IBVS framework, enabling zero-shot cross-object and cross-scene generalization via joint image feature matching and Jacobian estimation. Evaluated in sim-to-real transfer, the method achieves full convergence under nominal conditions; under disturbances, it reduces positioning error by 31.2% compared to classical IBVS while matching the convergence rate of supervised learning methods. Real-world experiments demonstrate applicability to industrial bin-picking and grasping of unseen objects, requiring only category-level reference images. This work bridges classical IBVS and learning-based approaches, significantly enhancing robustness and generalization without task-specific training.

Technology Category

Application Category

📝 Abstract
Visual servoing enables robots to precisely position their end-effector relative to a target object. While classical methods rely on hand-crafted features and thus are universally applicable without task-specific training, they often struggle with occlusions and environmental variations, whereas learning-based approaches improve robustness but typically require extensive training. We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction, combining the advantages of both paradigms while also being able to generalize beyond the provided sample. Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 31.2% relative improvement in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task- or object-specific training. Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects using only a reference from the same category. Our code and simulation environment are available at: https://alessandroscherl.github.io/ViT-VS/
Problem

Research questions and friction points this paper is trying to address.

Improves robot end-effector positioning using pretrained vision transformers.
Enhances robustness in visual servoing without task-specific training.
Generalizes to unseen objects using semantic feature extraction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pretrained vision transformers for feature extraction
Combines classical and learning-based visual servoing advantages
Generalizes beyond training samples without task-specific training
🔎 Similar Papers
No similar papers found.
A
Alessandro Scherl
Department of Computer Technology, University of Alicante, Spain; Industrial Engineering Department, UAS Technikum Vienna, Austria
Stefan Thalhammer
Stefan Thalhammer
UAS Technikum Vienna
Computer VisionRoboticsMachine Learning
B
Bernhard Neuberger
Industrial Engineering Department, UAS Technikum Vienna, Austria
W
Wilfried Wober
Industrial Engineering Department, UAS Technikum Vienna, Austria
José García-Rodríguez
José García-Rodríguez
Full Professor of AI & HPC, University of Alicante (Spain)
Machine/Deep LearningComputer VisionRoboticsHPCAAL