🤖 AI Summary
This work addresses the limitations of existing AI-generated text detection methods, which rely on author labels, struggle with zero-shot generalization, and are vulnerable to adversarial attacks. The authors propose an unsupervised style representation learning framework that decouples and captures non-semantic stylistic features by freezing a semantic encoder and using only a style encoder to inversely reconstruct human-like text from machine-generated paraphrases. This approach enables style modeling without any author labels, supporting both zero-shot and few-shot detection settings. Experimental results demonstrate that the method achieves few-shot performance on par with or superior to current state-of-the-art approaches across multiple benchmarks, while its zero-shot effectiveness closely matches that of fully supervised models. Furthermore, it exhibits strong generalization capabilities in author verification and fine-grained style discrimination tasks.
📝 Abstract
The rapid development of large language models (LLMs) has raised concerns about misuse such as plagiarism, misinformation, and automated influence operations, motivating the need for robust detectors. Recent work has shown that neural representations of writing style are effective for detection and, crucially, robust to adversarial attacks that defeat most existing detectors. However, current style-based detectors rely on authorship labels for training, and are limited to few-shot inference for detection, requiring in-distribution samples that may not always be available. We learn discriminative style features without authorship labels by training a style encoder to reconstruct human-authored text from its machine-generated paraphrase; freezing a semantic encoder during training biases the style encoder to capture only the non-semantic features needed for reconstruction. We evaluate the learned representations via two detection strategies: a few-shot detector and a zero-shot DeepSVDD-based detector. Across benchmarks, our method matches or outperforms all baselines in the few-shot setting and, in the zero-shot regime, is competitive with fully supervised classifiers on in-distribution test data while generalizing better to unseen LLMs. Beyond detection, the learned representations generalize to unseen tasks, achieving competitive performance on authorship verification and fine-grained style discrimination despite never being trained on either objective.