Neural at ArchEHR-QA 2025: Agentic Prompt Optimization for Evidence-Grounded Clinical Question Answering

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses clinical question answering over electronic health records (EHRs) under low-supervision settings, aiming to bridge critical information gaps between clinicians and patients. A key challenge lies in jointly optimizing evidence retrieval and faithful answer generation. Method: We propose a two-stage framework: (1) sentence-level evidence identification, followed by (2) citation-aware answer synthesis to ensure traceability. Methodologically, we introduce the first agent-based prompt optimization using DSPy’s MIPROv2, jointly tuning instructions and few-shot examples; we further incorporate self-consistency voting to enhance evidence recall without compromising precision. Contribution/Results: On the hidden test set of ArchEHR-QA, our approach achieves a total score of 51.5—ranking second overall—and outperforms zero-shot and few-shot baselines by over 20 and 10 points, respectively. These results demonstrate the framework’s effectiveness and robustness in low-resource clinical QA scenarios.

Technology Category

Application Category

📝 Abstract

Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy's MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.

Problem

Research questions and friction points this paper is trying to address.

Optimizing prompts for clinical question answering

Improving evidence retrieval and answer generation

Enhancing reliability of AI in healthcare

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic prompt optimization for clinical QA

Decoupled evidence identification and answer synthesis

Self-consistency voting improves evidence recall

🔎 Similar Papers

No similar papers found.

Authors to Follow