The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates whether performance gains in current clinical vision-language models (VLMs) stem from genuine multimodal reasoning or superficial cues in textual prompts. Evaluating twelve open-source VLMs on neuroimaging datasets lacking reliable diagnostic signals (FOR2107 and OASIS-3), the authors identify and term this phenomenon the “scaffolding effect”—where merely mentioning “MRI” in the prompt significantly boosts model performance irrespective of the actual image content. Through confidence analysis, expert human evaluation, and preference alignment, they find that smaller models exhibit F1 score improvements of up to 58%, with 70–80% of this gain attributable to prompt phrasing. All models generated hallucinated image-based justifications, and after preference alignment, their performance regressed to random baseline levels, revealing that existing evaluation protocols fail to capture true multimodal diagnostic reasoning capabilities.

📝 Abstract

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

Problem

Research questions and friction points this paper is trying to address.

scaffold effect

vision-language models

clinical AI

modality collapse

multimodal evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

scaffold effect

vision-language models

clinical AI evaluation

modality collapse

prompt framing

🔎 Similar Papers

No similar papers found.

Authors to Follow