Interpretable Text Classification Applied to the Detection of LLM-generated Creative Writing

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study addresses the challenge of effectively distinguishing between human-authored fiction excerpts and creative text generated by large language models (LLMs). The authors propose an interpretable linear classifier based on unigram-level features, which achieves 98% accuracy on unseen test data—substantially outperforming human evaluators, whose performance is near chance. Their analysis reveals systematic, detectable patterns in LLM-generated text, including reduced lexical diversity in synonym usage, temporal drift, overuse of American English expressions, foreign language insertions, and heightened colloquialism. The classifier’s decisions are both highly accurate and interpretable, and its robustness persists even against straightforward adversarial attempts to evade detection. This approach thus offers a reliable and transparent tool for identifying machine-generated creative content.

Technology Category

Application Category

📝 Abstract

We consider the problem of distinguishing human-written creative fiction (excerpts from novels) from similar text generated by an LLM. Our results show that, while human observers perform poorly (near chance levels) on this binary classification task, a variety of machine-learning models achieve accuracy in the range 0.93 - 0.98 over a previously unseen test set, even using only short samples and single-token (unigram) features. We therefore employ an inherently interpretable (linear) classifier (with a test accuracy of 0.98), in order to elucidate the underlying reasons for this high accuracy. In our analysis, we identify specific unigram features indicative of LLM-generated text, one of the most important being that the LLM tends to use a larger variety of synonyms, thereby skewing the probability distributions in a manner that is easy to detect for a machine learning classifier, yet very difficult for a human observer. Four additional explanation categories were also identified, namely, temporal drift, Americanisms, foreign language usage, and colloquialisms. As identification of the AI-generated text depends on a constellation of such features, the classification appears robust, and therefore not easy to circumvent by malicious actors intent on misrepresenting AI-generated text as human work.

Problem

Research questions and friction points this paper is trying to address.

interpretable text classification

LLM-generated text

human-written fiction

text detection

creative writing

Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretable classification

LLM detection

unigram features