A Comprehensive Dataset for Human vs. AI Generated Text Detection

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting AI-generated text and performing model attribution remain challenging due to the lack of comprehensive, realistic, and multi-source benchmark datasets. Method: This paper introduces the first integrated benchmark dataset (58,000+ samples) combining high-quality real-world news articles from *The New York Times* with synthetic texts generated by six state-of-the-art LLMs—Gemma-2-9B, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o—supporting both binary human/AI classification and fine-grained model attribution. Contribution/Results: It is the first work to systematically unify authentic journalistic content with diverse, high-fidelity synthetic outputs, enabling reproducible, multi-dimensional evaluation. Empirical results show that current top-performing methods achieve only 58.35% accuracy on human/AI detection and 8.92% on model attribution, highlighting significant room for improvement. The dataset and evaluation framework are publicly released to advance research in AI-generated text detection and provenance analysis.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35%, and attributing AI texts to their generating models with an accuracy of 8.92%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/gsingh1-py/train.
Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated text versus human-written content
Attributing synthetic texts to specific large language models
Addressing content authenticity concerns with comprehensive datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale dataset with human and AI texts
Combined real articles with multiple LLM-generated versions
Established baselines for detection and model attribution
🔎 Similar Papers
No similar papers found.
R
Rajarshi Roy
Kalyani Government Engineering College, India
Nasrin Imanpour
Nasrin Imanpour
PhD, Computer Science and Engineering
Artificial IntelligenceMachine LearningComputer Vision
A
Ashhar Aziz
Indraprastha Institute of Information Technology Delhi, India
S
Shashwat Bajpai
BITS Pilani Hyderabad Campus, India
G
Gurpreet Singh
Indian Institute of Information Technology Guwahati, India
S
Shwetangshu Biswas
National Institute of Technology Silchar, India
K
Kapil Wanaskar
San Jose State University, USA
Parth Patwa
Parth Patwa
Amazon
Machine LearningDeep LearningNatural Language ProcessingComputational LinguisticsComputer
Subhankar Ghosh
Subhankar Ghosh
Indian Institute of Technology
Computer VisionMachine LearningArtificial Intelligence
S
Shreyas Dixit
Vishwakarma Institute of Information Technology, India
N
Nilesh Ranjan Pal
Kalyani Government Engineering College, India
Vipula Rawte
Vipula Rawte
AI Institute of University of South Carolina
Text MiningNatural Language ProcessingDeep LearningSemantic WebOntology
Ritvik Garimella
Ritvik Garimella
PhD @ UofSC
NeuroSymbolic AIMultimodal LearningDeep LearningNLP
G
Gaytri Jena
Gandhi Institute for Technological Advancement, India
Amit Sheth
Amit Sheth
NCR Chair & Prof.; Founding Director, AI Institute; U. of South Carolina
Neurosymbolic AIKnowledge GraphKnowledge-infused LearningSemantic WebArtificial Intelligence
Vasu Sharma
Vasu Sharma
Facebook AI Research (FAIR)
Generative AILLMsComputer VisionNatural Language ProcessingMultimodal ML
Aishwarya Naresh Reganti
Aishwarya Naresh Reganti
Amazon
Artificial Social IntelligenceMultimodal MLGraph Neural NetworksNatural Language Processing
Vinija Jain
Vinija Jain
Meta | Ex: Amazon, Oracle, Palo Alto Networks
AINatural Language ProcessingMultimodal AIRecommender SystemsInformation Retrieval
Aman Chadha
Aman Chadha
GenAI Leadership @ Apple • Stanford AI • UW-Madison ECE • Ex: Apple, AWS, Alexa, Nvidia
Multimodal AINatural Language ProcessingComputer VisionSpeech ProcessingRecommender Systems
A
Amitava Das
Birla Institute of Technology and Science Pilani Goa, India