Vision Language Model Helps Private Information De-Identification in Vision Data

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of effective automated mechanisms for sanitizing sensitive text—such as protected health information in medical images—from visual data. To this end, the authors propose VisShield, a novel framework that integrates instruction tuning with optical character recognition (OCR) to enable end-to-end detection and redaction of privacy-sensitive textual content. Central to this approach is OPTIC, a privacy-oriented instruction-tuning dataset, along with a tailored training strategy that equips vision-language models with the capability to accurately localize and mask sensitive text. By jointly predicting bounding boxes and generating precise masks, VisShield significantly outperforms existing methods in both localization accuracy and redaction effectiveness, offering a robust pathway for deploying vision-language models in privacy-critical applications.

📝 Abstract

Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance privacy in text-based applications, privacy risks associated with visual inputs remain largely overlooked such as Protected Health Information (PHI) in medical images. To tackle this problem, two key tasks: accurately localizing sensitive text and processing it to ensure privacy protection should be performed. To address this issue, we introduce VisShield (Vision Privacy Shield), an end-to-end framework designed to enhance the privacy awareness of VLMs. Our framework consists of two key components: a specialized instruction-tuning dataset OPTIC (Optical Privacy Text Instruction Collection) and a tailored training methodology. The dataset provides diverse privacy-oriented prompts that guide VLMs to perform targeted Optical Character Recognition (OCR) for precise localization of sensitive text, while the training strategy ensures effective adaptation of VLMs to privacy-preserving tasks. Specifically, our approach ensures that VLMs recognize privacy-sensitive text and output precise bounding boxes for detected entities, allowing for effective masking of sensitive information. Extensive experiments demonstrate that our framework significantly outperforms existing approaches in handling private information, paving the way for privacy-preserving applications in vision-language models. Our dataset and code can be found here.

Problem

Research questions and friction points this paper is trying to address.

privacy

visual language models

de-identification

sensitive text

Protected Health Information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Models

Privacy-Preserving

De-identification