AutoPrune: Each Complexity Deserves a Pruning Policy

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language model pruning methods rely on fixed, layer-wise heuristics, failing to adapt to input- and task-specific complexity variations—leading to misaligned visual token removal and suboptimal inference trajectories. This paper proposes a training-free, adaptive pruning framework: it models task complexity via mutual information between vision and text tokens, and introduces the first complexity-aware, personalized pruning curve; combined with logistic retention curve projection under budget constraints, it enables dynamic token retention decisions. The method is model-agnostic, compatible with diverse vision-language and embodied AI architectures. Evaluated on LLaVA-1.5-7B, it prunes 89% of visual tokens, reduces FLOPs by 76.8%, and achieves 96.7% of the original model’s average accuracy—outperforming PDrop by 9.1%—demonstrating superior efficiency–accuracy trade-offs.

Technology Category

Application Category

📝 Abstract
The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model's holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.
Problem

Research questions and friction points this paper is trying to address.

Optimizes pruning policies for varying input complexities in vision-language models
Replaces fixed pruning schedules with adaptive token elimination strategies
Reduces computational demands while maintaining model accuracy across tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Complexity-Adaptive Pruning tailors policies to input complexities
Quantifies mutual information between visual and textual tokens
Projects signal to budget-constrained logistic retention curves
🔎 Similar Papers
No similar papers found.
H
Hanshi Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University
Y
Yuhao Xu
AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University
Zekun Xu
Zekun Xu
Amazon
Machine LearningStatistical Model
J
Jin Gao
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information
Yufan Liu
Yufan Liu
Institute of Automation, Chinese Academy of Sciences
Image/video processingKnowledge DistillationSaliency detectionModel compressionVideo coding
W
Weiming Hu
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information; School of Information Science and Technology, ShanghaiTech University
K
Ke Wang
KargoBot
Zhipeng Zhang
Zhipeng Zhang
School of Artificial Intelligence, Shanghai Jiao Tong University
Computer Vision,Object Tracking and Segmentation