PhysDox: Benchmarking LLMs on Physical Feasibility Auditing of Physiological Sensing Protocols

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the frequent infeasibility of physiological sensing protocols generated by large language models due to violations of physical constraints and the absence of systematic evaluation methods. The authors introduce PhysDox, a benchmark that formalizes and quantifies the task of auditing physical feasibility in physiological protocols. They construct a dataset comprising 683 expert-annotated samples and 5,000 expanded instances, evaluated through a two-stage framework involving severity detection and constraint-level diagnosis. Comprehensive analyses—including multi-model comparisons, error attribution, and attention probing—reveal that even the best-performing model achieves only a 53.0% macro F1 score in the first stage, with markedly degraded end-to-end performance. Notably, implicit constraint violations are missed at twice the rate of explicit hardware violations (ρ = 0.81, p < 0.01), exposing significant scaffold bias and high omission rates in current models.

📝 Abstract

Large language models (LLMs) increasingly assist in experimental design, yet fluent protocols often remain physically infeasible. We introduce PhysDox, a physical feasibility auditing benchmark for biomedical protocols comprising a 683-sample expert-curated Gold set and a 5,000-sample Silver set across six sensing domains. We formulate the task as a two-stage evaluation: severity detection classifying protocols as valid, minor, or fatal, followed by the constraint-level diagnosis of fatal violations. Evaluating 6 LLMs across 4 inference strategies yields a peak Stage-1 macro-F1 of only 53.0. Moreover, strong oracle diagnosis collapses during end-to-end evaluation due to correlated cascade errors. Error analysis reveals scaffold bias, where models conflate procedural completeness with physical validity. Consequently, implicit constraints exhibit a 2 times higher miss rate than explicit hardware violations, supported by strong statistical correlation at $ρ{=}0.81$ and $p{<}0.01$. Trace analysis of false negatives exposes a 54%--46% split between attention and judgment failures, ultimately demonstrating that protocol auditing demands calibrated feasibility reasoning rather than factual recall or longer rationales.

Problem

Research questions and friction points this paper is trying to address.

physical feasibility

LLM evaluation

biomedical protocols

physiological sensing

constraint violation

Innovation

Methods, ideas, or system contributions that make the work stand out.

physical feasibility auditing

LLM benchmarking

biomedical protocol validation