Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the “reasoning tax” phenomenon in vision-language models, where enhanced perceptual capabilities during instruction tuning often come at the expense of reasoning performance. To mitigate this trade-off, the authors propose Input-Adaptive, Modality-Aware Depth Aggregation (IADA), a lightweight cross-layer fusion mechanism that dynamically integrates representations from varying network depths via a low-rank bottleneck structure. This approach jointly preserves both perceptual and reasoning abilities during fine-tuning and is compatible with parameter-efficient strategies such as LoRA, introducing only 0.14 million additional parameters. Experiments on Qwen3-VL-2B demonstrate that IADA improves average reasoning scores by 9.5 points and perceptual scores by 3.3 points over standard LoRA fine-tuning.
📝 Abstract
Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.
Problem

Research questions and friction points this paper is trying to address.

reasoning tax
vision-language models
supervised fine-tuning
reasoning degradation
perception-reasoning trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

Input-Adaptive Depth Aggregation
reasoning tax
vision-language models
parameter-efficient fine-tuning
cross-depth representation
🔎 Similar Papers
No similar papers found.