MedCTA: A Benchmark for Clinical Tool Agents

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current medical AI evaluation benchmarks inadequately assess models’ reliability in tool invocation and reasoning during multi-step clinical decision-making. This work proposes the first benchmark grounded in real-world multimodal clinical data—including medical images, pathology slides, and clinical reports—that incorporates clinician-validated implicit task steps and executable tool trajectories. The framework enables end-to-end evaluation of tool selection, parameter validity, execution stability, and fidelity of task trajectories. By integrating multimodal large language models, clinical tool APIs, and fine-grained metrics, the benchmark evaluates 18 state-of-the-art systems, revealing that even advanced models remain notably fragile in autonomous tool usage. The study releases an open-source dataset and evaluation suite to advance the development of trustworthy medical AI.

📝 Abstract

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

Problem

Research questions and friction points this paper is trying to address.

medical AI agents

clinical tool use

benchmark

multimodal clinical tasks

agent reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

clinical tool agents

multimodal medical benchmark

process-aware evaluation