A Comprehensive Dataset for Human vs. AI Generated Text Detection

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Detecting AI-generated text and performing model attribution remain challenging due to the lack of comprehensive, realistic, and multi-source benchmark datasets. Method: This paper introduces the first integrated benchmark dataset (58,000+ samples) combining high-quality real-world news articles from *The New York Times* with synthetic texts generated by six state-of-the-art LLMs—Gemma-2-9B, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o—supporting both binary human/AI classification and fine-grained model attribution. Contribution/Results: It is the first work to systematically unify authentic journalistic content with diverse, high-fidelity synthetic outputs, enabling reproducible, multi-dimensional evaluation. Empirical results show that current top-performing methods achieve only 58.35% accuracy on human/AI detection and 8.92% on model attribution, highlighting significant room for improvement. The dataset and evaluation framework are publicly released to advance research in AI-generated text detection and provenance analysis.

Technology Category

Application Category

📝 Abstract

The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35%, and attributing AI texts to their generating models with an accuracy of 8.92%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/gsingh1-py/train.

Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated text versus human-written content

Attributing synthetic texts to specific large language models

Addressing content authenticity concerns with comprehensive datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale dataset with human and AI texts

Combined real articles with multiple LLM-generated versions

Established baselines for detection and model attribution

🔎 Similar Papers

No similar papers found.

Authors to Follow