🤖 AI Summary
This study investigates the capacity of large language models (LLMs) to evaluate clinical trial reports against the CONSORT guidelines, exposing practical limitations in automating medical compliance assessment. Using a behavior–metacognitive analytical framework and expert-annotated data, we systematically compare zero-shot, few-shot, and chain-of-thought prompting strategies across three dimensions: reasoning trace fidelity, uncertainty articulation, and generation of alternative interpretations. Results reveal substantial heterogeneity in LLM performance across CONSORT items, with pervasive logical leaps, selective evidence omission, and attribution biases—particularly under conditions requiring multi-step causal inference or interpretation of ambiguous phrasing. Notably, this work introduces, for the first time, a metacognitive lens to medical text evaluation, establishing both a methodological foundation for explainable, verifiable clinical AI and empirically grounded boundaries on current LLM capabilities in regulatory assessment contexts.
📝 Abstract
Despite the rapid expansion of Large Language Models (LLMs) in healthcare, the ability of these systems to assess clinical trial reporting according to CONSORT standards remains unclear, particularly with respect to their cognitive and reasoning strategies. This study applies a behavioral and metacognitive analytic approach with expert-validated data, systematically comparing two representative LLMs under three prompt conditions. Clear differences emerged in how the models approached various CONSORT items, and prompt types, including shifts in reasoning style, explicit uncertainty, and alternative interpretations shaped response patterns. Our results highlight the current limitations of these systems in clinical compliance automation and underscore the importance of understanding their cognitive adaptations and strategic behavior in developing more explainable and reliable medical AI.