VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Supplementary Code (Coming Soon) Checkpoints

Abstract

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full retraining to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. We then fine-tune only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Code and checkpoints are available in the supplementary materials. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions.

Description of image

Overall structure of VER. VER comprises two key components: the Base Vision Transformer (BVT), which processes images into unified representations; the Vision Expert Library (VEL), which stores a diverse set of specialized vision experts and selectively utilizes the experts to mimic teacher vision foundation models and enhance performance in downstream robotic tasks. Our framework consists of two phases: (1) Pretraining, where we distill multiple foundation models (DINOv2, ViT, CLIP) into VER; (2) Downstream Robotic Tasks, where we freeze the experts and train a lightweight Robot Router (<0.4% parameters) that dynamically selects task-relevant visual features to guide the policy head in generating appropriate robotic actions. This two-stage approach enables efficient knowledge distillation from diverse vision foundation models and adaptive feature selection for robotic tasks.

Patch Feature Visualization

Patch feature visualization on pen task across 10 training random seeds.

Patch feature visualization on relocate task across 10 training random seeds.

Patch feature visualization on pick and place task. We compare with Theia (left), VER before expert selection (middle), and VER after expert selection (right). After expert selection, VER concentrates on task-relevant objects and suppresses robot-related and background patches.