EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research struggles to investigate scientific writing revision behaviors and evaluate the performance of large language models (LLMs) in scientific writing assistance due to a lack of publicly available early-stage revision data. This work addresses this gap by systematically extracting revision traces from authors’ draft stages in arXiv LaTeX source files. By aligning annotated draft text with final published paragraphs, the study constructs revision pairs and employs LLM-based filtering followed by human validation to curate a high-quality, large-scale paragraph-level revision dataset. From 1.28 million candidate pairs, the authors obtain 578,000 authentic revision instances and introduce the first benchmark for detecting revisions in the scientific drafting process, offering a valuable resource for dynamic writing research and LLM-assisted editing.
📝 Abstract
Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.
Problem

Research questions and friction points this paper is trying to address.

scientific writing
revision traces
early-stage revisions
LaTeX
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific writing revisions
LaTeX writing traces
commented-out text extraction
LLM-based filtering
early-stage revision dataset
🔎 Similar Papers
No similar papers found.
Léane Jourdan
Léane Jourdan
Doctorante en informatique, LS2N, Nantes Université
NLPWriting assistanceText revision
J
Julien Aubert-Béduchaud
Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
Y
Yannis Chupin
Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
M
Marah Baccari
Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
Florian Boudin
Florian Boudin
Associate Professor, LS2N - Nantes Université and JFLI - National Institute of Informatics / Tokyo
Natural Language ProcessingInformation RetrievalComputational Linguistics