Detecting Text Manipulation in Images using Vision Language Models

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work presents the first systematic evaluation of large vision-language models (VLMs) for text tampering detection—a previously unexplored research direction. We benchmark both proprietary (e.g., GPT-4o) and leading open-source VLMs on synthetic datasets and real-world scenarios, including in-the-wild textual content and forged identity documents. Results show that while current VLMs exhibit non-negligible capability in identifying text tampering, their generalization remains limited—particularly for fine-grained, character-level alterations and low-fidelity forgeries. Open-source VLMs achieve performance comparable to, yet consistently lag behind, GPT-4o. In contrast, specialized image forensics models demonstrate severe generalization failure on text-specific tampering tasks. Our study establishes the first empirical benchmark for VLMs in text authenticity verification, revealing both their promise and critical bottlenecks in trustworthy vision-language understanding.

Technology Category

Application Category

📝 Abstract

Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closed- and open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT- 4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.

Problem

Research questions and friction points this paper is trying to address.

Detecting text manipulation in images using vision language models

Benchmarking VLMs on in-the-wild scene texts and fantasy ID cards

Analyzing generalization problems in image manipulation-specific VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing VLMs on text manipulation datasets

Benchmarking image-specific VLMs for text detection

Testing VLMs on wild scene texts and ID cards

🔎 Similar Papers

No similar papers found.

Authors to Follow