Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

career value

178K/year

🤖 AI Summary

This study investigates whether DeepSeek-OCR relies on genuine visual capabilities or language priors under high-ratio visual-text compression and evaluates its reliability in long-context scenarios. By employing sentence- and word-level semantic perturbations to disentangle linguistic priors, combined with semantic corruption tests, context stress evaluations, and vision-language decoupling analyses, the work provides the first empirical evidence of the severe dependence of end-to-end OCR models on language priors: accuracy drops precipitously from 90% to 20% without linguistic support, hallucinations increase as visual tokens decrease, and the model completely fails at around 10,000 text tokens. In contrast, traditional OCR methods demonstrate greater robustness. The study also establishes a comprehensive multi-model robustness benchmark encompassing 13 baseline approaches.

Technology Category

Application Category

📝 Abstract

DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question:"Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?"By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.

Problem

Research questions and friction points this paper is trying to address.

OCR robustness

vision-text compression

language priors

semantic corruption

long-context bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-text compression

language priors

semantic corruption