🤖 AI Summary
This paper addresses the challenge of quantifying inter-rater agreement for three-level hierarchical data—such as longitudinal imaging assessments—involving multiple raters, multiple time points, and repeated measurements. Within the generalized linear mixed-effects model (GLMM) framework, we propose an extended concordance correlation coefficient (CCC) and its confidence interval estimation method. Departing from conventional Fisher’s Z transformation, we introduce fiducial inference—adapted here for the first time to multilevel mixed-effects models—and integrate it with a model linearization approximation to yield robust CCC interval estimates. Monte Carlo simulations demonstrate that our method substantially improves empirical coverage probabilities (achieving near-nominal levels) and reduces expected interval width under moderate sample sizes. We validate the approach in two clinical applications: MRI-based osteoarthritis scoring and diffusion MRI tractography assessment. Furthermore, we extend it to evaluate consistency among AI-based assessments. The proposed methodology provides a generalizable statistical tool for complex longitudinal, multi-rater studies.
📝 Abstract
A generalization of the classical concordance correlation coefficient (CCC) is considered under a three-level design where multiple raters rate every subject over time, and each rater is rating every subject multiple times at each measuring time point. The ratings can be discrete or continuous. A methodology is developed for the interval estimation of the CCC based on a suitable linearization of the model along with an adaptation of the fiducial inference approach. The resulting confidence intervals have satisfactory coverage probabilities and shorter expected widths compared to the interval based on Fisher Z-transformation, even under moderate sample sizes. Two real applications available in the literature are discussed. The first application is based on a clinical trial to determine if various treatments are more effective than a placebo for treating knee pain associated with osteoarthritis. The CCC was used to assess agreement among the manual measurements of the joint space widths on plain radiographs by two raters, and the computer-generated measurements of digitalized radiographs. The second example is on a corticospinal tractography, and the CCC was once again applied in order to evaluate the agreement between a well-trained technologist and a neuroradiologist regarding the measurements of fiber number in both the right and left corticospinal tracts. Other relevant applications of our general approach are highlighted in many areas including artificial intelligence.