🤖 AI Summary
This study investigates whether ChatGPT can serve as an objective, citation-free alternative to traditional bibliometric indicators for evaluating research quality in clinical medicine.
Method: Leveraging the complete REF 2021 Clinical Medicine sub-panel dataset—including all 9,872 publications and 31 submitting departments—we conducted the first large-scale empirical analysis in this domain. Quality scores were generated using GPT-4o, GPT-4o-mini, and GPT-3.5 Turbo, and rigorously correlated with expert-assessed REF scores and citation rates across departmental and journal levels.
Contribution/Results: At the department level, ChatGPT mean scores significantly correlated with REF scores (r = 0.395); correlation strengthened at the journal level (r = 0.495) and exhibited a negative association with citation rates (r = −0.148), suggesting superior sensitivity to high-impact, nascent research not yet widely cited. The findings challenge citation-centric evaluation paradigms and establish a reproducible, AI-driven methodological framework for research quality assessment grounded in empirical evidence.
📝 Abstract
Estimating the quality of published research is important for evaluations of departments, researchers, and job candidates. Citation-based indicators sometimes support these tasks, but do not work for new articles and have low or moderate accuracy. Previous research has shown that ChatGPT can estimate the quality of research articles, with its scores correlating positively with an expert scores proxy in all fields, and often more strongly than citation-based indicators, except for clinical medicine. ChatGPT scores may therefore replace citation-based indicators for some applications. This article investigates the clinical medicine anomaly with the largest dataset yet and a more detailed analysis. The results showed that ChatGPT 4o-mini scores for articles submitted to the UK's Research Excellence Framework (REF) 2021 Unit of Assessment (UoA) 1 Clinical Medicine correlated positively (r=0.134, n=9872) with departmental mean REF scores, against a theoretical maximum correlation of r=0.226. ChatGPT 4o and 3.5 turbo also gave positive correlations. At the departmental level, mean ChatGPT scores correlated more strongly with departmental mean REF scores (r=0.395, n=31). For the 100 journals with the most articles in UoA 1, their mean ChatGPT score correlated strongly with their REF score (r=0.495) but negatively with their citation rate (r=-0.148). Journal and departmental anomalies in these results point to ChatGPT being ineffective at assessing the quality of research in prestigious medical journals or research directly affecting human health, or both. Nevertheless, the results give evidence of ChatGPT's ability to assess research quality overall for Clinical Medicine, where it might replace citation-based indicators for new research.