π€ AI Summary
This work proposes modeling clinical decision-making as an evolutionary search over executable programs to reduce the adaptation cost of large language models (LLMs) in clinical workflows, thereby circumventing expensive fine-tuning and manual prompt engineering. The authors introduce, for the first time, an LLM-guided MAP-Elites algorithm that automatically optimizes task-specific fitness functions during inference, leveraging a frozen visual-language model (e.g., MedGemma) and structured JSON output constraints. The approach demonstrates effectiveness across three clinical scenarios: emergency triage (achieving 87.1% Semigran accuracy and 0.97 recall), interactive patient interviews (optimizing the accuracyβcost trade-off with strong generalization), and PneumoniaMNIST image classification (where performance improves through prompt evolution alone). Gains stem from interpretable, program-level policy structures rather than superficial prompt tuning.
π Abstract
Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions.
Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.