TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This paper introduces the first end-to-end speech-to-visual document question answering framework that bypasses text-based intermediaries (e.g., ASR, TTS, OCR), directly addressing the core challenge of retrieving and reasoning over textual content embedded in document images using spoken queries. Methodologically, it constructs a cross-modal speech–document joint embedding space, incorporates a layout-aware re-ranking mechanism to enhance retrieval precision, and designs a lightweight QA module for direct speech-to-visual knowledge alignment and inference. Key contributions include: (1) the first fully text-free speech–document retrieval paradigm; (2) the first bilingual (Chinese–English) speech–document RAG benchmark dataset; and (3) consistent and significant improvements over conventional multi-stage pipelines across multi-scale experiments—achieving +12.6% absolute gain in retrieval accuracy and +9.3% in QA F1 score.

Technology Category

Application Category

📝 Abstract

Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech--document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag

Problem

Research questions and friction points this paper is trying to address.

Speech-based QA without text conversion

Retrieving knowledge from visual documents

End-to-end textless multimodal RAG framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-based question answering without text

Layout-aware reranking mechanism for retrieval

End-to-end textless pipeline eliminating ASR/OCR/TTS

🔎 Similar Papers

No similar papers found.