🤖 AI Summary
This paper introduces the first end-to-end speech-to-visual document question answering framework that bypasses text-based intermediaries (e.g., ASR, TTS, OCR), directly addressing the core challenge of retrieving and reasoning over textual content embedded in document images using spoken queries. Methodologically, it constructs a cross-modal speech–document joint embedding space, incorporates a layout-aware re-ranking mechanism to enhance retrieval precision, and designs a lightweight QA module for direct speech-to-visual knowledge alignment and inference. Key contributions include: (1) the first fully text-free speech–document retrieval paradigm; (2) the first bilingual (Chinese–English) speech–document RAG benchmark dataset; and (3) consistent and significant improvements over conventional multi-stage pipelines across multi-scale experiments—achieving +12.6% absolute gain in retrieval accuracy and +9.3% in QA F1 score.
📝 Abstract
Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech--document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag