AVIS: Adaptive Test-Time Scaling for Vision-Language Models

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of existing vision-language models during inference, which stems from processing redundant visual context and long reasoning chains, compounded by the lack of joint optimization between these two aspects. The paper proposes the first lightweight adaptive strategy that jointly optimizes Visual Context Scaling (VCS) and Visual Reasoning Scaling (VRS). It achieves O(N) visual token compression via a training-free Key Diversity-aware Visual (KDV) pruning method and dynamically adjusts the number of reasoning paths using a learnable difficulty predictor, all while remaining compatible with shared prefill architectures. Evaluated across multiple image and video reasoning benchmarks, the approach consistently outperforms baselines that optimize only VCS or VRS, simultaneously reducing latency and improving accuracy, and remains effective even when applied to models fine-tuned with reinforcement learning.
📝 Abstract
Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy--compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Test-Time Scaling
Visual Context Scaling
Visual Reasoning Scaling
Inference Cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Inference
Vision-Language Models
Test-Time Scaling
Visual Token Pruning
Self-Consistency
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3
2024-08-29arXiv.orgCitations: 7