QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of insufficient training data in OCR-free visual document understanding (VDU), which hinders visual encoders from accurately localizing text regions relevant to a given query, this paper proposes a non-intrusive, dual-module decoupled framework. The query-aware module generates a unique query vector to guide visual attention, while the query-agnostic module independently models token-level spatial relationships; both operate outside the visual attention blocks, enabling collaborative optimization of query embeddings and visual representations. We adapt ViT as the visual encoder, incorporate a lightweight query embedding injection mechanism, and design a fine-tuning strategy tailored for low-resource settings. Evaluated on multiple text-dense document datasets, our method achieves significant performance gains—particularly outperforming state-of-the-art methods under <1% labeled data—while maintaining high inference efficiency and seamless compatibility with mainstream vision-language model architectures.

Technology Category

Application Category

📝 Abstract
In Visual Document Understanding (VDU) tasks, fine-tuning a pre-trained Vision-Language Model (VLM) with new datasets often falls short in optimizing the vision encoder to identify query-specific regions in text-rich document images. Existing methods that directly inject queries into model layers by modifying the network architecture often struggle to adapt to new datasets with limited annotations. To address this, we introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder, leading to notable performance gains, particularly in data-scarce fine-tuning scenarios. Specifically, our approach introduces a dual-module framework: a query-aware module that generates a unique query vector to precisely guide the model's focus, as well as a query-agnostic module that captures the positional relationships among tokens, ensuring robust spatial understanding. Notably, both modules operate independently of the vision attention blocks, facilitating targeted learning of query embeddings and enhancing visual semantic identification. Experiments with OCR-free VLMs across multiple datasets demonstrate significant performance improvements using our method, especially in handling text-rich documents in data-scarce environments.
Problem

Research questions and friction points this paper is trying to address.

Optimize vision encoder for query-specific regions in documents
Adapt query injection to data-scarce fine-tuning scenarios
Enhance OCR-free visual document understanding performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates query embeddings into vision encoder
Dual-module framework for query guidance
Operates independently of vision attention blocks
🔎 Similar Papers
No similar papers found.
B
Binh M. Le
Sungkyunkwan University, S. Korea
Shaoyuan Xu
Shaoyuan Xu
Amazon Inc., Senior Applied Scientist
NLPLLMsMulti-modality LearningMachine LearningComputer Vision
J
Jinmiao Fu
Amazon, USA
Z
Zhishen Huang
Amazon, USA
M
Moyan Li
Amazon, USA
Y
Yanhui Guo
Amazon, USA
H
Hongdong Li
Amazon, USA
Sameera Ramasinghe
Sameera Ramasinghe
Founding scientist, Pluralis Research
Computer visionMachine LearningLLMs
Bryan Wang
Bryan Wang
Adobe Research
Human-AI InteractionInteractive AIVideoAudio