Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the privacy compliance risks associated with Japanese large language models (LLMs), whose pretraining corpora may contain Special Care-Required Personal Information (SCPI) as defined under Japan’s Act on the Protection of Personal Information. It presents the first systematic investigation into the automatic detection of SCPI in Japanese text. The authors propose a novel approach leveraging LLM-assisted annotation to construct the first Japanese SCPI dataset and subsequently train dedicated machine learning classifiers for sensitive information identification. Experimental results demonstrate that the proposed method effectively identifies SCPI in Japanese texts, confirming its technical feasibility. Nevertheless, the study also highlights persistent challenges in achieving high-precision detection, thereby filling a critical research gap in Japanese privacy-sensitive information recognition.

📝 Abstract

Sensitive personal information can appear in large-scale pre-training corpora for large language models (LLMs). Detecting and filtering such information is therefore essential to ensure compliance with privacy regulations and prevent unintended information leakage. However, in contrast to English and other languages, research into sensitive personal information has been limited in the Japanese language. In this study, we focus on sensitive personal data defined as special care-required personal information (SCPI) under Japan's Act on the Protection of Personal Information (APPI). We construct an SCPI dataset using LLM-based annotation and train machine learning models to rapidly detect SCPI in text. As a result, our SCPI classifier can effectively identify information related to SCPI. This study is the first to explore SCPI detection in Japanese text corpora, highlighting the challenges of accurate detection.

Problem

Research questions and friction points this paper is trying to address.

sensitive personal information

Japanese pre-training corpora

large language models

privacy regulations

special care-required personal information

Innovation

Methods, ideas, or system contributions that make the work stand out.

sensitive personal information

Japanese LLM corpora

SCPI detection