Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the performance degradation of multimodal large language models (MLLMs) in visual reasoning tasks when employing chain-of-thought (CoT) prompting, a limitation often attributed to premature answer generation and insufficient utilization of visual information. To mitigate this issue, the authors propose Attentive-CoT (Att-CoT), an attention-guided supervised fine-tuning approach that enhances CoT reasoning by deferring answer commitment and sustaining consistent attention on visual tokens, all without modifying the underlying model architecture. Att-CoT seamlessly integrates into existing CoT-based supervised fine-tuning pipelines and is compatible with a variety of MLLMs. Extensive experiments demonstrate that Att-CoT consistently outperforms standard fine-tuning strategies across three visual reasoning benchmarks and six prominent multimodal models.

📝 Abstract

The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought

Multimodal Large Language Models

visual reasoning

attention mechanism

fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-guided fine-tuning

Chain-of-Thought reasoning

Multimodal Large Language Models