DODO: Discrete OCR Diffusion Models

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the high computational cost and slow inference speed of traditional OCR methods that rely on autoregressive decoding, particularly when processing long documents. To overcome these limitations, the authors introduce a novel parallel decoding framework for OCR based on vision-language modeling, which—unlike conventional approaches—leverages block-wise discrete diffusion for the first time in this domain. The proposed method avoids global diffusion synchronization errors while maintaining output determinism and enabling efficient inference. Experimental results demonstrate that the approach achieves accuracy comparable to state-of-the-art methods while accelerating inference by up to three times relative to autoregressive baselines.

Technology Category

Application Category

📝 Abstract

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

Problem

Research questions and friction points this paper is trying to address.

Optical Character Recognition

diffusion models

autoregressive decoding

parallel decoding

text generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion

block-wise generation

OCR acceleration