Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the degradation of fine-grained details—such as blurred strokes and distorted characters—in text rendering by visual autoregressive models, which stems from limitations in visual tokenizers’ reconstruction fidelity. To overcome this without retraining either the tokenizer or the autoregressive model, the authors propose the Residual Decoder Adapter (RDA), a non-intrusive post-processing module. RDA leverages a paired codebook and a parallel residual branch to learn pixel-level residuals between reconstructed and ground-truth images, preserving the original token space and ensuring full model compatibility. Evaluated on the TextAtlas benchmark, RDA significantly enhances rendering quality: when applied to Janus-Pro, OCR accuracy improves from 24.52% to 58.26% on TextVisionBlend and from 12.75% to 36.81% on StyledTextSynth.

📝 Abstract

Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

Problem

Research questions and friction points this paper is trying to address.

text rendering

visual tokenizer

autoregressive models

fine-grained detail

image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual Decoder Adapter

visual tokenizer adaptation

autoregressive text rendering