Advancing Conversational Diagnostic AI with Multimodal Reasoning

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current conversational diagnostic systems predominantly rely on text-only interaction, failing to support real-time multimodal clinical analysis—such as medical images, ECGs, and PDF reports—essential for telemedicine. To address this, we propose a state-driven multimodal conversational diagnostic framework built upon Gemini 2.0 Flash, featuring a dynamic state-aware mechanism that jointly enables multimodal understanding, uncertainty modeling, and structured clinical questioning. Crucially, we introduce the first method for autonomously generating follow-up questions based on patient-state uncertainty, emulating expert clinicians’ diagnostic reasoning. Evaluated on 105 OSCE cases, our system significantly outperformed general practitioners across 7 of 9 multimodal and 29 of 32 non-multimodal clinical dimensions—including diagnostic accuracy—demonstrating synergistic enhancement between multimodal capability and diagnostic efficacy.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for multimodal medical diagnosis beyond text-only interactions
Enhancing AI's ability to interpret diverse medical data during consultations
Comparing AI diagnostic performance with physicians in structured clinical scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal data interpretation during consultations
State-aware dialogue framework for dynamic control
Uncertainty-directed follow-up questions for structured history-taking
🔎 Similar Papers
No similar papers found.
K
Khaled Saab
Google DeepMind
J
Jan Freyberg
Google Research
Chunjong Park
Chunjong Park
Google DeepMind
Tim Strother
Tim Strother
Google DeepMind
Deep LearningMachine Learning
Yong Cheng
Yong Cheng
Google Deepmind
Wei-Hung Weng
Wei-Hung Weng
Google DeepMind
artificial intelligencemachine learningnatural language processingmedical imaginghealthcare
D
David G. T. Barrett
Google DeepMind
David Stutz
David Stutz
Research Scientist, DeepMind
deep learningai agentsai for scienceuncertainty estimationcomputer vision
N
Nenad Tomašev
Google DeepMind
Anil Palepu
Anil Palepu
PhD Student, Harvard-MIT Health Science & Technology
Valentin Liévin
Valentin Liévin
Google DeepMind
machine learninghealthcare
Y
Yash Sharma
Google Research
Roma Ruparel
Roma Ruparel
Unknown affiliation
Abdullah Ahmed
Abdullah Ahmed
University of Massachusetts Amherst
Elahe Vedadi
Elahe Vedadi
Google DeepMind
AIDistributed ComputingInformation TheorySecure & Private Computing
K
K. Kanada
Google Research
C
Cían Hughes
Google Research
Y
Yun Liu
Google Research
G
Geoff Brown
Google DeepMind
Y
Yang Gao
Google DeepMind
S
Sean Li
Google DeepMind
S
S. Mahdavi
Google DeepMind
J
James Manyika
Google Research
Katherine Chou
Katherine Chou
Google
MLHealthGraphics
Yossi Matias
Yossi Matias
Google
A
A. Hassidim
Google DeepMind
D
Dale R. Webster
Google Research
Pushmeet Kohli
Pushmeet Kohli
DeepMind
AI for ScienceMachine LearningAI SafetyComputer VisionProgram Synthesis
S
S. M. A. Eslami
Google DeepMind
J
Joelle Barral
Google DeepMind
Adam Rodman
Adam Rodman
Assistant Professor of Medicine, Harvard Medical School
Clinical reasoningAIdigital educationmedical history
V
Vivek Natarajan
Google Research
M
M. Schaekermann
Google Research
Tao Tu
Tao Tu
Columbia University, Google
multi-modal neuroimagingmachine learningneural information processing
A
A. Karthikesalingam
Google Research
Ryutaro Tanno
Ryutaro Tanno
Research Scientist, Google DeepMind
Machine LearningDeep LearningHealthcareComputer Vision