LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing low-rank compression methods for vision-language models optimize only local reconstruction error, rely on heuristic or uniform rank allocation, and neglect feed-forward networks, thereby struggling to balance efficiency and accuracy under low-precision inference. This work proposes LASER, a framework that introduces a curvature-weighted singular value decomposition (SVD) objective based on second-order loss approximation, guided by Kronecker-factored Fisher information for low-rank factorization. LASER further incorporates a calibration gradient-driven, cross-layer loss-aware rank allocation mechanism and extends compression to feed-forward networks, enabling hybrid SVD-and-quantization compression. Experiments demonstrate that LASER achieves over 2.3× decoding speedup compared to prior methods while maintaining high accuracy.

📝 Abstract

Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has emerged as a promising compression technique, yet existing methods often optimize local matrix reconstruction error, rely on uniform or heuristic rank allocation, and focus mainly on attention projections while leaving feed-forward networks underexplored. In this paper, we propose~\textit{LASER} (\textbf{L}oss-\textbf{A}ware \textbf{S}ingular-value d\textbf{E}composition and \textbf{R}ank allocation), a low-rank compression framework for efficient low-precision VLM inference. LASER derives a curvature-weighted SVD objective from a second-order approximation of the model loss and uses Kronecker-factored Fisher information to guide decomposition toward downstream performance rather than reconstruction alone. We further introduce a loss-aware cross-layer rank allocation strategy based on calibration gradients, enabling more effective parameter budgeting across layers. Finally, we extend low-rank compression to FFN layers through a hybrid scheme that combines SVD with quantization. The evaluation results show that LASER achieves more than $2.3\times$ decoding speedup over previous work while preserving strong accuracy under low-precision inference.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

low-rank decomposition

rank allocation

model compression

low-precision inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Loss-aware SVD

Rank allocation

Low-rank compression