🤖 AI Summary
This work addresses the excessive GPU memory consumption in GUI visual language models during multi-turn interactions, caused by the linear growth of KV cache, which hinders practical deployment. The authors propose a training-free KV cache compression framework that discards conventional assumptions of shared saliency maps and fixed truncation, instead introducing the first spatiotemporally adaptive importance reweighting mechanism at the attention subspace granularity. The method dynamically calibrates token importance by integrating spatial mutual information–driven subspace-aware scoring, a temporal stability discounting scheme, and entropy-guided adaptive temperature scaling. Evaluated on four GUI benchmarks, it achieves state-of-the-art average accuracy while reducing peak GPU memory usage by nearly 40% under a 20% KV cache budget, with negligible FLOPs overhead during compression.
📝 Abstract
Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.