FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of vision-language-action (VLA) policies to visual representation collapse during action-supervised fine-tuning, where equivalent action states cause the loss of critical visual residual information. The paper formally characterizes this issue as “visual residual collapse over action fibers” and introduces FiberTune, a novel training strategy that identifies and filters out feature directions correlated with action prediction via online action probes. FiberTune preserves the remaining visual residuals by aligning them with a frozen teacher model and applying effective rank regularization—without incurring additional inference overhead. Evaluated across six simulated tasks and the real-world SO-101 environment, FiberTune consistently outperforms baselines, achieving a 10.7% improvement in SR(5) on the CALVIN ABC-to-D task and increasing success rates on SO-101 from 72.7% to 78.1%.
📝 Abstract
Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.
Problem

Research questions and friction points this paper is trying to address.

vision-language-action
visual collapse
action fibers
fine-tuning
visual residuals
Innovation

Methods, ideas, or system contributions that make the work stand out.

action-fiber residuals
vision-language-action fine-tuning
visual representation preservation
online action probing
effective rank regularization
🔎 Similar Papers
2024-08-29arXiv.orgCitations: 7
H
Haihao Lin
University of Chinese Academy of Sciences
X
Xiangsheng Huang
Hebei Key Laboratory of Cognitive Intelligence, Xiong'an Institute of Innovation
X
Xiao Yang
University of Chinese Academy of Sciences
W
Weibang Zhou
University of Chinese Academy of Sciences
Yiqi Zhang
Yiqi Zhang
Pennsylvania State University, University Park
Driving BehaviorITSTransportation SafetyHuman Performance ModelingHuman-Computer Interaction
B
Bo Yang
Hebei University of Technology
S
Simin Zeng
University of Chinese Academy of Sciences
J
Jiawei Yang
Beijing Information Science and Technology University
Z
Zhengyang Wang
Beijing Information Science and Technology University
Jiahui Du
Jiahui Du
Postdoc, Beihang University
Education TechnologyHigher Education