🤖 AI Summary
This work addresses a critical limitation in existing visual text comprehension (VTC) approaches, which render text into fixed-layout images without accounting for the intrinsic mechanisms of vision-language models (VLMs) in processing visualized text, thereby failing to effectively leverage key information. The study identifies, for the first time, a “localized but underutilized” phenomenon wherein VLMs locate evidential regions through mid-to-late layer attention yet do not fully exploit them. Building on this insight, the authors propose an adaptive, training-free, and model-agnostic rendering method: by analyzing intermediate-layer attention maps, it maps salient visual patches to corresponding textual spans and dynamically re-renders inputs to amplify critical regions. This plug-and-play module consistently enhances performance across nine VTC benchmarks and four VLM backbones, demonstrating robustness and synergistic gains when combined with post-training strategies.
📝 Abstract
Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.