🤖 AI Summary
Existing large language model (LLM) text detectors exhibit insufficient robustness across diverse decoding strategies, undermining their practical reliability. Method: This work systematically investigates how sampling-based decoding—specifically temperature scaling, top-p (nucleus) sampling, and related variants—affects text detectability, focusing on the degradation mechanisms induced by (sub)word-level distributional perturbations. We construct a large-scale benchmark comprising 37 distinct decoding configurations and comprehensively evaluate state-of-the-art detectors using AUROC as the primary metric. Contribution/Results: We find that minor adjustments to decoding parameters can reduce AUROC from near 100% to as low as 1%, exposing detectors’ extreme sensitivity to generation settings. The study identifies fundamental flaws in current evaluation paradigms and advocates for decoder-agnostic, robust detection frameworks alongside standardized evaluation protocols for decoding diversity. To foster reproducibility and community advancement, we publicly release all data and code.
📝 Abstract
As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model's (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework https://github.com/BaggerOfWords/Sampling-and-Detection