Large-scale online deanonymization with LLMs

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the critical vulnerability of online anonymity by demonstrating that users can be re-identified at scale using only their anonymous textual content. We propose the first end-to-end de-anonymization framework based on large language models (LLMs), which automatically extracts identity-relevant semantic features directly from raw, unstructured text across arbitrary platforms. By integrating embedding-based matching with logical inference for verification, our approach achieves high-precision and scalable user re-identification in both open- and closed-world settings. Experiments on three real-world datasets show that our method substantially outperforms existing techniques, attaining a 68% recall at 90% precision—whereas the best non-LLM baseline nearly fails entirely—thereby exposing severe weaknesses in current anonymity-preserving mechanisms.

Technology Category

Application Category

📝 Abstract

We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to prior deanonymization work (e.g., on the Netflix prize) that required structured data or manual feature engineering, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.

Problem

Research questions and friction points this paper is trying to address.

deanonymization

large language models

online privacy

pseudonymous identification

identity matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models

deanonymization

semantic embeddings