VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current vision foundation models (VFMs) exhibit limited generalization in robotic learning: single-model approaches suffer from strong domain specificity, while multi-model distillation incurs rigid feature selection and high costs for domain-knowledge integration. To address this, we propose a dynamic expert routing framework that constructs a visual expert library via multi-VFM distillation and introduces a lightweight patchwise routing network coupled with a curriculum-based Top-K annealing strategy for task-adaptive expert selection. Our method enables robot-prior injection via minimal parameter fine-tuning and supports scalable expert integration. Built upon the ViT architecture, it unifies multi-model knowledge distillation, fine-grained routing, and parameter-efficient adaptation. Evaluated on 17 diverse robotic tasks, it achieves state-of-the-art performance, significantly improving focus on critical regions, cross-task generalization, and robustness to environmental variations.

Technology Category

Application Category

📝 Abstract

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

Problem

Research questions and friction points this paper is trying to address.

Individual vision foundation models lack cross-task generality for robotics

Existing distillation methods yield inflexible feature selection requiring full retraining

Current approaches struggle with efficient robot-domain knowledge integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills multiple vision foundation models into expert library

Fine-tunes lightweight routing network for dynamic expert selection

Uses Patchwise Expert Routing with Curriculum Top-K Annealing

🔎 Similar Papers

No similar papers found.