The Regularizing Power of Language-Training Deepfake Detectors

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the limited generalizability and interpretability of current deepfake detection models, which often overfit to dataset-specific low-level artifacts. The authors propose a reinforcement learning–based “explain-then-classify” training paradigm that leverages only binary labels to guide the model toward reasoning with high-level, linguistically describable semantic features. Their approach employs a dual-encoder architecture combining a frozen specialized detector with a LoRA-finetuned multimodal large language model, trained through a two-stage process involving binary alignment followed by reinforcement learning. Experiments demonstrate that the method significantly outperforms state-of-the-art models across multiple benchmarks, achieving superior cross-dataset generalization and producing interpretable outputs—even without relying on explicit chains of thought during inference.

📝 Abstract

Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.

Problem

Research questions and friction points this paper is trying to address.

deepfake detection

overfitting

generalization

interpretability

domain-specific artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

language regularization

multimodal LLM

deepfake detection