Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This study investigates whether DeepSeek-OCR relies on genuine visual capabilities or language priors under high-ratio visual-text compression and evaluates its reliability in long-context scenarios. By employing sentence- and word-level semantic perturbations to disentangle linguistic priors, combined with semantic corruption tests, context stress evaluations, and vision-language decoupling analyses, the work provides the first empirical evidence of the severe dependence of end-to-end OCR models on language priors: accuracy drops precipitously from 90% to 20% without linguistic support, hallucinations increase as visual tokens decrease, and the model completely fails at around 10,000 text tokens. In contrast, traditional OCR methods demonstrate greater robustness. The study also establishes a comprehensive multi-model robustness benchmark encompassing 13 baseline approaches.

Technology Category

Application Category

📝 Abstract
DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question:"Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?"By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.
Problem

Research questions and friction points this paper is trying to address.

OCR robustness
vision-text compression
language priors
semantic corruption
long-context bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-text compression
language priors
semantic corruption
OCR robustness
long-context bottleneck
Y
Yunhao Liang
Chengdu Institute of Computer Applications, CAS
R
Ruixuan Ying
IMRAM, Tohoku University
B
Bo Li
China Tower Corporation Limited
H
Hong Li
China Tower Corporation Limited
K
Kai Yan
China Tower Corporation Limited
Qingwen Li
Qingwen Li
Suzhou Institute of nanotech and nanobionics
carbon materialssynthesissortingapplication
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding
O
Okamoto Satoshi
IMRAM, Tohoku University
Zhe Cui
Zhe Cui
Beijing University of Posts and Telecommunications
fingerprint
S
Shiwen Ni
Shenzhen Institutes of Advanced Technology, CAS