🤖 AI Summary
This study addresses the lack of standardized evaluation protocols in machine-generated text detection, which hinders fair comparison of model performance. The authors systematically evaluate 15 detection methods across diverse datasets comprising both human-written and machine-generated English texts, covering six detector families and seven generative models. Employing a multi-dataset cross-validation framework and multiple evaluation metrics, they find that no single detector consistently outperforms others across all scenarios—most excel only in specific settings—and overall performance degrades significantly on novel, human-authored texts from high-stakes domains. The work highlights the strong dependence of detector efficacy on training and evaluation data as well as metric choice, exposing critical blind spots in current evaluation paradigms and underscoring the decisive role of methodological decisions in shaping empirical conclusions.
📝 Abstract
With the rise of generative language models, machine-generated text detection has become a critical challenge. A wide variety of models is available, but inconsistent datasets, evaluation metrics, and assessment strategies obscure comparisons of model effectiveness. To address this, we evaluate 15 different detection models from six distinct systems, as well as seven trained models, across seven English-language textual test sets and three creative human-written datasets. We provide an empirical analysis of model performance, the influence of training and evaluation data, and the impact of key metrics. We find that no single system excels in all areas and nearly all are effective for certain tasks, and the representation of model performance is critically linked to dataset and metric choices. We find high variance in model ranks based on datasets and metrics, and overall poor performance on novel human-written texts in high-risk domains. Across datasets and metrics, we find that methodological choices that are often assumed or overlooked are essential for clearly and accurately reflecting model performance.