AutoMedPrompt: A New Framework for Optimizing LLM Medical Prompts Using Textual Gradients

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual prompt engineering for large language models (LLMs) in medical question answering suffers from poor generalizability, high labor cost, and susceptibility to noise. Method: We propose the first TextGrad-driven automated system prompt optimization framework tailored for medicine—requiring no model fine-tuning and enabling end-to-end differentiable, system-level prompt search via a multi-round feedback-guided iterative mechanism. Contribution/Results: Our approach demonstrates cross-domain prompt generalizability, validated on specialized benchmarks including NephSAP (nephrology). Experiments show state-of-the-art accuracy of 82.6% on PubMedQA, 77.7% on MedQA, and 63.8% on NephSAP—consistently outperforming GPT-4, Claude 3 Opus, and Med-PaLM 2. This establishes a scalable, efficient paradigm for domain-specific prompt engineering in clinical AI applications.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated increasingly sophisticated performance in medical and other fields of knowledge. Traditional methods of creating specialist LLMs require extensive fine-tuning and training of models on large datasets. Recently, prompt engineering, instead of fine-tuning, has shown potential to boost the performance of general foundation models. However, prompting methods such as chain-of-thought (CoT) may not be suitable for all subspecialty, and k-shot approaches may introduce irrelevant tokens into the context space. We present AutoMedPrompt, which explores the use of textual gradients to elicit medically relevant reasoning through system prompt optimization. AutoMedPrompt leverages TextGrad's automatic differentiation via text to improve the ability of general foundation LLMs. We evaluated AutoMedPrompt on Llama 3, an open-source LLM, using several QA benchmarks, including MedQA, PubMedQA, and the nephrology subspecialty-specific NephSAP. Our results show that prompting with textual gradients outperforms previous methods on open-source LLMs and surpasses proprietary models such as GPT-4, Claude 3 Opus, and Med-PaLM 2. AutoMedPrompt sets a new state-of-the-art (SOTA) performance on PubMedQA with an accuracy of 82.6$%$, while also outperforming previous prompting strategies on open-sourced models for MedQA (77.7$%$) and NephSAP (63.8$%$).
Problem

Research questions and friction points this paper is trying to address.

Optimizing medical prompts using textual gradients
Enhancing LLMs for medical reasoning without fine-tuning
Improving QA accuracy in medical subspecialties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Textual gradients optimize medical prompts
AutoMedPrompt enhances general LLM performance
Outperforms proprietary models on QA benchmarks
🔎 Similar Papers
No similar papers found.
Sean Wu
Sean Wu
ETH Zurich
Computer VisionComputer Graphics3D VisionRoboticsAutonomous Driving
M
Michael Koo
4Biological Sciences Division, University of Chicago
Fabien Scalzo
Fabien Scalzo
Associate Professor, Computer Science/Neurology, University of California Los Angeles (UCLA)
Biomedical EngineeringMachine LearningComputer VisionArtificial IntelligenceMedicine
I
Ira Kurtz
2Department of Medicine, University of California, Los Angeles, 3Brain Research Institute, University of California, Los Angeles