You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Preprint platforms (e.g., arXiv) inadvertently expose sensitive information—including personally identifiable information (PII), cloud credentials, and private URLs—by publicly hosting uncurated LaTeX source files, comments, and auxiliary artifacts, posing severe security and reputational risks. To address this, we propose LaTeXpOsEd, a four-stage detection framework that conducts the first systematic analysis of 100,000 raw arXiv submissions. We introduce LLMSec-DB, the first benchmark dataset for secret detection in academic documents, and integrate regex-based pattern matching, logical filtering, and large language models (LLMs) to identify fine-grained sensitive content—including non-cited material and embedded comments. Experimental evaluation uncovers thousands of high-risk leakage instances involving identities, authentication tokens, and internal communications. Our work exposes critical security blind spots in scholarly infrastructure and establishes a reproducible methodology and evaluation foundation for automated security auditing of academic artifacts.

Technology Category

Application Category

📝 Abstract

The widespread use of preprint repositories such as arXiv has accelerated the communication of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate LLMs' secret-detection capabilities, we introduce LLMSec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author communications, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd

Problem

Research questions and friction points this paper is trying to address.

Analyzing security risks in preprint archives from unsanitized source materials

Detecting sensitive information leaks using large language models and pattern matching

Identifying exposed credentials and confidential communications in scientific submissions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combined pattern matching with logical filtering techniques

Integrated large language models for secret detection

Created benchmark to evaluate 25 state-of-the-art models

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models