🤖 AI Summary
This work systematically investigates, for the first time, how visual-textual compression (VTC) impacts the long-context understanding capabilities of vision-language models (VLMs). To this end, we introduce VTCBench—the first long-context benchmark specifically designed for VTC evaluation—covering retrieval, reasoning, and memory tasks, along with its real-world extension, VTCBench-Wild. We propose a multi-dimensional evaluation framework integrating OCR encoding, 2D dense representation compression, cross-modal attention analysis, and dialogue memory tracking, enabling fine-grained, unified assessment of both open- and closed-source VLMs. Experimental results reveal that while state-of-the-art VLMs accurately decode OCR-extracted text, their long-range factual association and implicit reasoning abilities degrade substantially under VTC compression—causing an average performance drop of 42.7% across all three task categories. This exposes a critical semantic fidelity gap in current VTC methods, directly undermining VLMs’ contextual comprehension.
📝 Abstract
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.