What Hard Tokens Reveal: Exploiting Low-confidence Tokens for Membership Inference Attacks against Large Language Models

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge that existing membership inference attacks (MIAs) on large language models struggle to distinguish between model generalization and memorization, thereby limiting their effectiveness. To overcome this limitation, the authors propose HT-MIA, a novel framework that focuses specifically on low-confidence (i.e., difficult) tokens. By comparing token-level output probability distributions between a fine-tuned target model and a pre-trained reference model, HT-MIA effectively disentangles memorization signals from generalization behavior. The approach significantly enhances both the accuracy and robustness of membership inference, consistently outperforming seven state-of-the-art baselines across medical and general benchmark datasets. Furthermore, the study empirically validates the critical role of difficult tokens in membership leakage and demonstrates the framework’s utility in evaluating defense mechanisms such as differential privacy.

Technology Category

Application Category

📝 Abstract

With the widespread adoption of Large Language Models (LLMs) and increasingly stringent privacy regulations, protecting data privacy in LLMs has become essential, especially for privacy-sensitive applications. Membership Inference Attacks (MIAs) attempt to determine whether a specific data sample was included in the model training/fine-tuning dataset, posing serious privacy risks. However, most existing MIA techniques against LLMs rely on sequence-level aggregated prediction statistics, which fail to distinguish prediction improvements caused by generalization from those caused by memorization, leading to low attack effectiveness. To address this limitation, we propose a novel membership inference approach that captures the token-level probabilities for low-confidence (hard) tokens, where membership signals are more pronounced. By comparing token-level probability improvements at hard tokens between a fine-tuned target model and a pre-trained reference model, HT-MIA isolates strong and robust membership signals that are obscured by prior MIA approaches. Extensive experiments on both domain-specific medical datasets and general-purpose benchmarks demonstrate that HT-MIA consistently outperforms seven state-of-the-art MIA baselines. We further investigate differentially private training as an effective defense mechanism against MIAs in LLMs. Overall, our HT-MIA framework establishes hard-token based analysis as a state-of-the-art foundation for advancing membership inference attacks and defenses for LLMs.

Problem

Research questions and friction points this paper is trying to address.

Membership Inference Attacks

Large Language Models

Hard Tokens

Privacy

Token-level Probabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Membership Inference Attacks

Hard Tokens

Token-level Analysis