🤖 AI Summary
This work presents the first systematic evaluation of large vision-language models (VLMs) for text tampering detection—a previously unexplored research direction. We benchmark both proprietary (e.g., GPT-4o) and leading open-source VLMs on synthetic datasets and real-world scenarios, including in-the-wild textual content and forged identity documents. Results show that while current VLMs exhibit non-negligible capability in identifying text tampering, their generalization remains limited—particularly for fine-grained, character-level alterations and low-fidelity forgeries. Open-source VLMs achieve performance comparable to, yet consistently lag behind, GPT-4o. In contrast, specialized image forensics models demonstrate severe generalization failure on text-specific tampering tasks. Our study establishes the first empirical benchmark for VLMs in text authenticity verification, revealing both their promise and critical bottlenecks in trustworthy vision-language understanding.
📝 Abstract
Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closed- and open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT- 4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.