Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the insufficient 3D geometric modeling in AI-generated video detection, this paper proposes a vanishing-point-based geometric-aware detection framework. It is the first to explicitly introduce vanishing points as 3D geometric cues into video detection, capturing the fundamental disparity in 3D geometric temporal consistency between real and synthetic videos. Methodologically, we design a geometric-aware Transformer: (i) geometric position encoding embeds spatial constraints imposed by vanishing points; (ii) a joint temporal-geometric attention mechanism models spatiotemporal geometric coherence; and (iii) an EMA-optimized geometric classification head enhances training stability. Our approach achieves significant improvements over state-of-the-art methods on static-scene AI video benchmarks and demonstrates strong cross-domain generalization—robustly detecting videos generated by unseen models (e.g., SVD, Pika) without fine-tuning.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.

Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated videos using 3D geometric temporal consistency

Addressing limited exploration of 3D geometric patterns in existing methods

Revealing geometric discrepancies between real and AI-generated videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using vanishing points to represent 3D geometry patterns

Introducing a geometry-aware transformer with temporal-geometric attention

Constructing a static scene dataset for stable feature extraction

🔎 Similar Papers

Detecting AI-Generated Video via Frame Consistency