Interpretable Text Classification Applied to the Detection of LLM-generated Creative Writing

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of effectively distinguishing between human-authored fiction excerpts and creative text generated by large language models (LLMs). The authors propose an interpretable linear classifier based on unigram-level features, which achieves 98% accuracy on unseen test data—substantially outperforming human evaluators, whose performance is near chance. Their analysis reveals systematic, detectable patterns in LLM-generated text, including reduced lexical diversity in synonym usage, temporal drift, overuse of American English expressions, foreign language insertions, and heightened colloquialism. The classifier’s decisions are both highly accurate and interpretable, and its robustness persists even against straightforward adversarial attempts to evade detection. This approach thus offers a reliable and transparent tool for identifying machine-generated creative content.

Technology Category

Application Category

📝 Abstract
We consider the problem of distinguishing human-written creative fiction (excerpts from novels) from similar text generated by an LLM. Our results show that, while human observers perform poorly (near chance levels) on this binary classification task, a variety of machine-learning models achieve accuracy in the range 0.93 - 0.98 over a previously unseen test set, even using only short samples and single-token (unigram) features. We therefore employ an inherently interpretable (linear) classifier (with a test accuracy of 0.98), in order to elucidate the underlying reasons for this high accuracy. In our analysis, we identify specific unigram features indicative of LLM-generated text, one of the most important being that the LLM tends to use a larger variety of synonyms, thereby skewing the probability distributions in a manner that is easy to detect for a machine learning classifier, yet very difficult for a human observer. Four additional explanation categories were also identified, namely, temporal drift, Americanisms, foreign language usage, and colloquialisms. As identification of the AI-generated text depends on a constellation of such features, the classification appears robust, and therefore not easy to circumvent by malicious actors intent on misrepresenting AI-generated text as human work.
Problem

Research questions and friction points this paper is trying to address.

interpretable text classification
LLM-generated text
human-written fiction
text detection
creative writing
Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretable classification
LLM detection
unigram features
creative writing
AI-generated text
🔎 Similar Papers
2024-08-08Conference on Empirical Methods in Natural Language ProcessingCitations: 9
M
Minerva Suvanto
Chalmers University of Technology, Gothenburg, Sweden
A
Andrea McGlinchey
Edinburgh Napier University, Edinburgh, UK
Mattias Wahde
Mattias Wahde
Professor of Applied Artificial Intelligence, Chalmers University of Technology
artificial intelligenceinterpretable AIglass-box AInatural language processingrobotics
P
Peter J Barclay
Edinburgh Napier University, Edinburgh, UK