Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study investigates the optimal trade-off between sample size (N) and annotations per sample (K) under a fixed annotation budget (N×K) to enhance the reliability and reproducibility of machine learning evaluation. Addressing pervasive inter-annotator disagreement, we systematically simulate statistical stability in model comparison across diverse (N,K) configurations using real-world multi-annotator classification datasets and empirically fitted annotation distributions. Key findings are: (1) K > 10 is often optimal for effectively suppressing annotation noise; (2) distribution-sensitive metrics—e.g., KL divergence and calibration error—benefit substantially from higher K, whereas point-estimate metrics like accuracy depend more critically on N; (3) when N×K ≤ 1000, modest increases in K yield substantial gains in evaluation credibility. This work provides the first quantitative characterization of the asymmetric impact of annotation strategy on evaluation robustness, delivering actionable guidelines for evaluation design under resource constraints.

Technology Category

Application Category

📝 Abstract

Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple annotators for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal $(N, K)$ configuration, given a fixed budget ($N imes K$), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with $N imes K$ at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal $N imes K$ almost always occurred for $K > 10$. Furthermore, the nature of the tradeoff between $K$ and $N$ -- or if one even existed -- depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of $K$. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.

Problem

Research questions and friction points this paper is trying to address.

Investigates trade-off between items (N) and annotations per item (K) for reliable ML evaluation.

Determines optimal (N, K) configuration under fixed budget for reproducible ML assessments.

Analyzes impact of human disagreement on evaluation metrics and data collection strategies.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal (N, K) trade-off for ML evaluation

Human disagreement impact on evaluation metrics

Budget-efficient annotation strategy for reliability

🔎 Similar Papers

Generalizability of experimental studies