ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing medical benchmarks fail to capture the real-world clinical process wherein physicians iteratively gather heterogeneous information under uncertainty and make irreversible sequential decisions. To address this gap, this work proposes ClinEnv—a multistage interactive environment grounded in real inpatient records—that introduces a longitudinal hospitalization simulation paradigm. Within this framework, an agent must actively query four specialized proxies at each stage to formulate diagnostic and therapeutic plans, with its decision quality and information-seeking behavior evaluated separately. This approach enables, for the first time, a decoupled assessment of these two dimensions, revealing significant deficiencies in large language models regarding management decisions and late-stage information acquisition. Experiments across seven prominent models show that even the best-performing model achieves only a 0.31 overall decision F1 score, with management-action F1 as low as 0.17, and continues to issue redundant queries throughout the clinical course.

📝 Abstract

Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them. We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation. Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses. ClinEnv scores both what the model decides, through deterministic ontology-grounded matching, and how it gathers information. Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1) and continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable.

Problem

Research questions and friction points this paper is trying to address.

interactive medical benchmark

long horizon decision-making

information acquisition

clinical simulation

EHR environment

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive medical benchmark

longitudinal inpatient simulation

information acquisition