Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language-action (VLA) models suffer from high computational overhead due to redundant visual tokens, severely hindering real-time robotic deployment. Existing task-agnostic pruning methods struggle to jointly preserve global semantic coherence and fine-grained spatial details. To address this, we propose Compressor-VLA: an instruction-guided dual-path visual token compression framework. Its core innovation lies in introducing, for the first time, a natural language instruction-modulated semantic task compressor and a spatial refinement compressor—respectively modeling task-relevant semantic context and critical spatial structures—to enable dynamic, adaptive visual information condensation. Evaluated on the LIBERO benchmark, Compressor-VLA achieves competitive performance while reducing FLOPs by 59% and decreasing visual token count by over 3×. This significantly enhances inference efficiency and deployment feasibility of VLA models on real-world robotic platforms.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that preserves fine-grained spatial details. This compression is dynamically modulated by the natural language instruction, allowing for the adaptive condensation of task-relevant visual information. Experimentally, extensive evaluations demonstrate that Compressor-VLA achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline. The real-robot deployments on a dual-arm robot platform validate the model's sim-to-real transferability and practical applicability. Moreover, qualitative analyses reveal that our instruction guidance dynamically steers the model's perceptual focus toward task-relevant objects, thereby validating the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in Vision-Language-Action models for robots
Preserves task-critical visual information through instruction-guided compression
Enables efficient real-time robotic manipulation with sim-to-real transferability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-conditioned hybrid token compression framework
Semantic Task Compressor preserves task-relevant context
Spatial Refinement Compressor maintains fine-grained spatial details
🔎 Similar Papers
No similar papers found.
Juntao Gao
Juntao Gao
Nara Institute of Science and Technology
Stochastic Network OptimizationMachine LearningIntelligent Transportation Systems
Feiyang Ye
Feiyang Ye
University of Technology Sydney, Ph.D student
Multi-Task Learning
J
Jing Zhang
School of Information Science and Technology, Beijing University of Technology
W
Wenjing Qian
School of Information Science and Technology, Beijing University of Technology