MuDoC: An Interactive Multimodal Document-grounded Conversational AI System

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of fragmented multimodal information and unverifiable responses in long-document interaction, this paper introduces the first interactive multimodal document dialogue system supporting joint text-and-chart retrieval and generation. Methodologically, it pioneers the synchronous modeling of native in-document charts and text within the dialogue generation process, implementing a GPT-4o–based multimodal architecture that integrates document-level text-chart alignment, cross-modal attention mechanisms, and source localization navigation. Key contributions include: (1) traceable, interleaved text-and-chart response generation with explicit provenance; (2) an intelligent textbook interface enabling one-click navigation to original text passages and corresponding charts, significantly improving factual accuracy and interaction interpretability; and (3) empirical validation on real-world textbook data, demonstrating high-fidelity, coordinated multimodal response generation.

Technology Category

Application Category

📝 Abstract
Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.
Problem

Research questions and friction points this paper is trying to address.

Multimodal document interaction
Leverage visuals and text
Interactive conversational AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal AI leveraging text and visuals
Interactive conversational agent based on GPT-4o
Instant navigation to source text and figures
🔎 Similar Papers
No similar papers found.
Karan Taneja
Karan Taneja
Georgia Institute of Technology
Multi-modal AgentsNatural Language ProcessingAI in EducationHuman-Computer Interaction
A
Ashok K. Goel
School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA 30332 USA