STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-horizon robotic manipulation methods model action sequences analogously to language modeling and rely on trajectory-level reinforcement learning (e.g., TPO/PPO), resulting in coarse-grained credit assignment and training instability. This work identifies that robotic actions exhibit causal, semantically distinct phases with heterogeneous difficulty, necessitating phase-aware optimization. To address this, we propose STARE—a Stage-Aware Reinforcement learning framework—comprising three key components: (1) semantic decomposition of action trajectories into interpretable phases; (2) phase-aligned dense reward shaping; and (3) an IPI progressive fine-tuning pipeline (Imitation → Preference → Interaction). Building upon STARE, we derive two concrete algorithms: STA-TPO for offline phase-wise preference optimization and STA-PPO for online intra-phase interaction optimization. Evaluated on SimplerEnv and ManiSkill3, STARE achieves 98.0% and 96.4% task success rates, respectively—substantially outperforming prior approaches.

Technology Category

Application Category

📝 Abstract
Recent advances in Vision-Language-Action (VLA) models, powered by large language models and reinforcement learning-based fine-tuning, have shown remarkable progress in robotic manipulation. Existing methods often treat long-horizon actions as linguistic sequences and apply trajectory-level optimization methods such as Trajectory-wise Preference Optimization (TPO) or Proximal Policy Optimization (PPO), leading to coarse credit assignment and unstable training. However, unlike language, where a unified semantic meaning is preserved despite flexible sentence order, action trajectories progress through causally chained stages with different learning difficulties. This motivates progressive stage optimization. Thereby, we present Stage-Aware Reinforcement (STARE), a module that decomposes a long-horizon action trajectory into semantically meaningful stages and provides dense, interpretable, and stage-aligned reinforcement signals. Integrating STARE into TPO and PPO, we yield Stage-Aware TPO (STA-TPO) and Stage-Aware PPO (STA-PPO) for offline stage-wise preference and online intra-stage interaction, respectively. Further building on supervised fine-tuning as initialization, we propose the Imitation -> Preference -> Interaction (IPI), a serial fine-tuning pipeline for improving action accuracy in VLA models. Experiments on SimplerEnv and ManiSkill3 demonstrate substantial gains, achieving state-of-the-art success rates of 98.0 percent on SimplerEnv and 96.4 percent on ManiSkill3 tasks.
Problem

Research questions and friction points this paper is trying to address.

Optimizes long-horizon robotic manipulation by decomposing action trajectories into stages
Provides dense, stage-aligned reinforcement signals to address coarse credit assignment
Improves action accuracy in Vision-Language-Action models via a progressive fine-tuning pipeline
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stage-aware reinforcement decomposes trajectories into semantic stages.
Integrates stage-aware signals into TPO and PPO for optimization.
Uses Imitation-Preference-Interaction pipeline to improve action accuracy.
🔎 Similar Papers
No similar papers found.