🤖 AI Summary
Existing methods struggle to effectively detect AI-generated text under distribution shifts, such as those arising from cross-domain settings, different generative models, or editing attacks. This work reframes the detection problem as a direction probing task within the representation space of language models. It proposes leveraging predefined guidance vectors in the hidden layers of a frozen language model to construct discriminative directions that separate human- from machine-generated text. A lightweight classifier is then trained on projection-based features derived from input representations along these directions. Notably, the approach requires no model fine-tuning and captures deep stylistic signals beyond superficial statistical patterns. Empirical results demonstrate that it consistently outperforms current detection methods across both in-distribution and various out-of-distribution scenarios.
📝 Abstract
Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.