Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of large vision-language models, which stems from processing a vast number of visual tokens. Existing compression approaches often rely on handcrafted heuristics and fail to model the multi-step sequential decision-making inherent in token selection. To overcome this limitation, the paper introduces language-guided reinforcement learning for visual token compression, formulating it as a sequential decision problem. The authors propose a self-supervised encoder to construct an efficient state representation and jointly optimize task performance and computational efficiency through a combination of imitation learning and Proximal Policy Optimization (PPO). This approach enables adaptive, multi-step pruning tailored to downstream tasks, achieving up to 66.7% visual token removal at inference time, a 54.2% reduction in FLOPs, and only a 0.7% average drop in accuracy—significantly outperforming conventional static or heuristic compression strategies.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) incur substantial inference costs due to the processing of a vast number of visual tokens. Existing methods typically struggle to model progressive visual token reduction as a multi-step decision process with sequential dependencies and often rely on hand-engineered scoring rules that lack adaptive optimization for complex reasoning trajectories. To overcome these limitations, we propose TPRL, a reinforcement learning framework that learns adaptive pruning trajectories through language-guided sequential optimization tied directly to end-task performance. We formulate visual token pruning as a sequential decision process with explicit state transitions and employ a self-supervised autoencoder to compress visual tokens into a compact state representation for efficient policy learning. The pruning policy is initialized through learning from demonstrations and subsequently fine-tuned using Proximal Policy Optimization (PPO) to jointly optimize task accuracy and computational efficiency. Our experimental results demonstrate that TPRL removes up to 66.7\% of visual tokens and achieves up to a 54.2\% reduction in FLOPs during inference while maintaining a near-lossless average accuracy drop of only 0.7\%. Code is released at \href{https://github.com/MagicVicCoder/TPRL}{\textcolor{mypink}{https://github.com/MagicVicCoder/TPRL}}.
Problem

Research questions and friction points this paper is trying to address.

visual token compression
large vision-language models
inference efficiency
sequential decision process
adaptive optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning
token compression
vision-language models
sequential decision making
policy optimization
🔎 Similar Papers
No similar papers found.
S
Sihan Cao
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Jianwei Zhang
Jianwei Zhang
Professor, School of Education, University at Albany, SUNY
CSCLlearning sciencestechnology for creativityknowledge buildinginquiry-based learning
P
Pengcheng Zheng
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
J
Jiaxin Yan
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
C
Caiyan Qin
School of Robotics and Advanced Manufacture, Harbin Institute of Technology, Shenzhen, China
Y
Yalan Ye
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Wei Dong
Wei Dong
PHD candidate, School of Computer Science and Engineering, Northwestern Polytechnical University,
Deep Learning
Peng Wang
Peng Wang
Professor, University of Electronic Science and Technology of China
computer visiondeep learningmachine learning
Y
Yang Yang
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Chaoning Zhang
Chaoning Zhang
Professor at UESTC (电子科技大学, China)
Computer VisionLLM and VLMGenAI and AIGC Detection