Primitive Subspaces Mediate Few-Shot Transfer in VLAs

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge of cost-effective adaptation of vision-language-action (VLA) policies to new industrial tasks, which typically requires full model fine-tuning. The authors propose a skill-primitive-based training framework that constructs a library of composable sub-skills usable at inference time, enabling few-shot generalization to unseen tasks without updating model parameters. They establish, for the first time, the causal necessity of subspace structure in the latent space of primitives for effective few-shot transfer and identify a methodological flaw in current segmented-policy evaluation—specifically, the use of action-range gating. Evaluated on REASSEMBLE and LIBERO-Long benchmarks with OpenVLA and π₀.₅ architectures using LoRA fine-tuning, the approach achieves 78% of the fully fine-tuned performance with only three demonstrations, yielding a threefold improvement in sample efficiency, with consistent results across multiple architectures, primitive sets, and datasets.

📝 Abstract

Deploying vision-language-action (VLA) policies in industrial environments requires the ability to teach new tasks at low cost, a property current VLAs lack, since each new task requires fine-tuning. We investigate whether primitive-aware training produces a transferable artifact: a learned library of sub-skills that can be composed at inference time, conditioned on a small number of demonstrations, to perform tasks the policy was never trained on. We train two VLA architectures with different inductive biases, OpenVLA and $π_{0.5}$, on the REASSEMBLE contact-rich assembly dataset under matched LoRA fine-tuning recipes and locked hyperparameters, varying training between flat trajectories and primitive-segmented episodes with primitive-specific language prompts. We hold out 6 object-task combinations from training and evaluate few-shot transfer: models receive $m \in \{0, 1, 3, 5, 10\}$ demonstrations of a held-out task and attempt execution without weight updates. We replicate across three training seeds and validate on a second dataset (LIBERO-Long). Primitive-trained models reach 78% of fine-tuned upper-bound performance with only m=3 demonstrations, while flat-trained models require m=10 demonstrations to reach the same level -- a $3\times$ sample efficiency gap that replicates across seeds, architectures, and datasets. To establish causation, we ablate the primitive-decodable subspace of hidden states and show few-shot transfer degrades by 32 percentage points while ablating a random subspace of equal dimensionality has no effect, indicating primitive representations are causally necessary rather than incidentally correlated with transfer. We identify and correct a methodological pitfall in evaluating chunked policies: family-wise inflation of single-step action-range gates produces order-of-magnitude higher false-failure rates against ground-truth human demonstrations.

Problem

Research questions and friction points this paper is trying to address.

few-shot transfer

vision-language-action

primitive subspaces

task generalization

sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

primitive subspaces

few-shot transfer

vision-language-action (VLA)