StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the binary detection of AI-generated text through a non-neural, interpretable stylistic analysis approach. Methodologically, it employs fine-grained linguistic preprocessing using spaCy—including part-of-speech tagging and dependency parsing—to extract thousands of n-gram frequency features, forming a modular stylistic representation; a lightweight gradient-boosting tree model (LightGBM) is then trained and optimized on a large-scale corpus exceeding 500,000 machine-generated samples. The key contribution lies in replacing opaque deep models with highly interpretable frequency-based features and an efficient tree-based classifier, achieving a principled trade-off between detection performance and transparency. Experimental results demonstrate that the method significantly outperforms baseline approaches while maintaining low computational overhead, offering a robust, trustworthy, and deployment-friendly solution for AI content governance.

Technology Category

Application Category

📝 Abstract
This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.
Problem

Research questions and friction points this paper is trying to address.

Detect AI-generated texts using stylometric features
Train classifier with 500,000 machine-generated texts
Optimize parameters for improved classification performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular stylometric pipeline with spaCy preprocessing
Light-gradient boosting machines as classifier
Large corpus training with parameter optimization
🔎 Similar Papers
No similar papers found.
Jeremi K. Ochab
Jeremi K. Ochab
Institute of Theoretical Physics, Jagiellonian University, Kraków
complex networksdata analysis for neurosciencesmachine learningnatural language processingcomputational stylistics
M
Mateusz Matias
Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Kraków, Poland
T
Tymoteusz Boba
Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Kraków, Poland
Tomasz Walkowiak
Tomasz Walkowiak
Politechnika Wrocławska
NLP