Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MGC detectors over-rely on superficial features and struggle to detect deeply paraphrased long-form text. To address this, we propose a dual-model framework integrating human writing style modeling with explicit discourse structure analysis. Our contributions include: (1) constructing the first long-text-oriented paraphrased LFQA and WP benchmark datasets; (2) designing a differential scoring mechanism and a PDTB-enhanced document-level encoding paradigm; and (3) jointly leveraging MhBART (style-aware) and DTransformer (discourse-structure-aware), augmented by GPT/DIPPER for high-quality synthetic training data generation. Evaluated on paraLFQA, paraWP, and M4 benchmarks, our method achieves absolute accuracy gains of 15.5%, 4.0%, and 1.5%, respectively—outperforming all state-of-the-art approaches. It effectively captures deceptive syntactic patterns and cross-sentence structural anomalies, demonstrating robustness against sophisticated paraphrasing.

Technology Category

Application Category

📝 Abstract
The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To address the challenge of detecting highly similar paraphrased texts, we propose MhBART, an encoder-decoder model designed to emulate human writing style while incorporating a novel difference score mechanism. This model outperforms strong classifier baselines and identifies deceptive sentence patterns. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets -- 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvement on M4 compared to SOTA approaches.
Problem

Research questions and friction points this paper is trying to address.

Detecting machine-generated content in long documents
Improving detection of paraphrased machine-generated texts
Incorporating discourse structure for enhanced content analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates discourse analysis for structural feature encoding
Uses PDTB preprocessing to capture document-level text structure
Develops paraphrased datasets using GPT and DIPPER tools
🔎 Similar Papers
No similar papers found.
Y
Yupei Li
GLAM team, Imperial College London, London, UK
M
M. Milling
CHI – Chair of Health Informatics, Technical University of Munich, Munich, Germany
Lucia Specia
Lucia Specia
Professor of Natural Language Processing, Imperial College London & Senior Director at Epic Games
B
Bjorn W. Schuller
CHI – Chair of Health Informatics, Technical University of Munich, Munich, Germany