🤖 AI Summary
This work addresses the fundamental limitation imposed by inherent noise in subjective rating datasets, which constrains the achievable correlation between computational models and human judgments. To date, there has been a lack of practical methods to quantify the upper bound of such correlation. The paper introduces ρ-Perfect, the first approach capable of estimating this theoretical ceiling in real-world settings. By modeling heteroscedastic noise in subjective ratings, the method derives an analytical upper bound on model-human correlation and validates it using a squared approximation of test-retest reliability. Demonstrated on speech quality assessment benchmarks, ρ-Perfect effectively disentangles whether performance plateaus stem from model deficiencies or intrinsic data quality limits, thereby offering a novel tool for evaluating and advancing subjective perception models.
📝 Abstract
Subjective ratings contain inherent noise that limits the model-human correlation, but this reliability issue is rarely quantified. In this paper, we present $\rho$-Perfect, a practical estimation of the highest achievable correlation of a model on subjectively rated datasets. We define $\rho$-Perfect to be the correlation between a perfect predictor and human ratings, and derive an estimate of the value based on heteroscedastic noise scenarios, a common occurrence in subjectively rated datasets. We show that $\rho$-Perfect squared estimates test-retest correlation and use this to validate the estimate. We demonstrate the use of $\rho$-Perfect on a speech quality dataset and show how the measure can distinguish between model limitations and data quality issues.