AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the limitation of existing evaluations for medical AI agents, which predominantly focus on final outputs while neglecting fine-grained behavioral analysis throughout the scientific research process. To bridge this gap, we propose the first comprehensive benchmark for automated medical research that decomposes agent behavior into five distinct phases: planning, implementation, validation, reasoning, and submission, spanning five task categories and two difficulty levels. We introduce a novel workflow-aware, phase-wise scoring mechanism that leverages multi-stage modeling, long-horizon interaction logs, and joint evaluation to enable fine-grained diagnostic assessment of agent behavior. Analysis of thousands of experimental runs reveals that the validation phase is the weakest link, with failures in validation and submission being primary contributors to performance degradation; notably, executions involving erroneous code yield an average score reduction of 48%.

📝 Abstract

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.

Problem

Research questions and friction points this paper is trying to address.

medical AI agents

benchmarking

workflow evaluation

autonomous research

stage-level analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

workflow-aware benchmark

agentic AI

medical auto-research