ProxyThinker: Test-Time Guidance through Small Visual Reasoners

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (LVLMs) struggle to acquire fine-grained visual reasoning capabilities cost-effectively. Method: This paper proposes a zero-training, decoding-time reasoning enhancement method that dynamically couples a lightweight visual reasoner during inference. It guides the LVLM toward self-verification and corrective “slow thinking” via output-distribution differencing and dynamic reweighting—without parameter updates or reinforcement fine-tuning. Contribution/Results: To our knowledge, this is the first approach enabling training-free capability transfer from small to large models. It achieves state-of-the-art performance on spatial reasoning, math-oriented visual question answering, and multi-domain benchmarks—matching full-scale reinforcement fine-tuning (RFT) models—while accelerating inference by 38× over comparable methods. Moreover, it supports multilingual LVLM co-optimization without architectural modification.

Technology Category

Application Category

📝 Abstract
Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $ imes$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.
Problem

Research questions and friction points this paper is trying to address.

Enables large models to inherit reasoning from small visual reasoners without training
Improves performance on spatial, mathematical, and multi-disciplinary visual reasoning
Achieves faster inference compared to previous decoding-time methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time technique without training
Modifies decoding dynamics for reasoning
Parallelism enables faster inference
🔎 Similar Papers
No similar papers found.
Z
Zilin Xiao
Rice University
Jaywon Koo
Jaywon Koo
Rice University
Computer VisionNatural Language Processing
Siru Ouyang
Siru Ouyang
University of Illinois Urbana-Champaign
Large Language ModelsReasoningAgent
J
Jefferson Hernandez
Rice University
Y
Yu Meng
University of Virginia
V
Vicente Ordonez
Rice University