GenAI Content Detection Task 2: AI vs. Human -- Academic Essay Authenticity Challenge

📅 2024-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the growing challenge of distinguishing human-written from AI-generated academic text. We introduce the first bilingual (English–Arabic) benchmark for academic text authenticity evaluation and launch a cross-lingual detection challenge under the COLING 2025 framework. Methodologically, we propose a fine-tuned Transformer-based detection backbone enhanced with semantic representations from large language models (e.g., Llama 2/3), integrated within a multi-dimensional evaluation framework that benchmarks against n-gram baselines. Our key contributions are: (1) releasing a high-quality, domain-adapted bilingual annotated dataset; (2) establishing a standardized definition for cross-lingual AI-text detection in academic contexts; and (3) empirically validating the strong generalization capability of state-of-the-art models—our best-performing system achieves F1-scores exceeding 0.98 on both English and Arabic test sets, substantially outperforming conventional approaches.

Technology Category

Application Category

📝 Abstract
This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows:"Given an essay, identify whether it is generated by a machine or authored by a human.'' The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, seven teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.
Problem

Research questions and friction points this paper is trying to address.

AI-generated content
human writing
text classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Academic Essay Authenticity Challenge
Transformer-based Models
Content Detection Technology
🔎 Similar Papers
No similar papers found.
Shammur Absar Chowdhury
Shammur Absar Chowdhury
Qatar Computing Research Institute
Conversational AIRepresentation LearningDeep LearningSpeech processingNLP
H
Hind Almerekhi
Qatar Computing Research Institute, HBKU, Qatar
Mucahid Kutlu
Mucahid Kutlu
Assistant Professor, Qatar University
Information RetrievalNatural Language Processing
K
Kaan Efe Keles
TOBB ETU, Türkiye
F
Fatema Ahmad
Qatar Computing Research Institute, HBKU, Qatar
Tasnim Mohiuddin
Tasnim Mohiuddin
Scientist, QCRI, HBKU
Machine LearningNatural Language Processing
G
George Mikros
Hamad Bin Khalifa University, Qatar
F
Firoj Alam
Qatar Computing Research Institute, HBKU, Qatar