🤖 AI Summary
To address the vulnerability of high-impact real-time speech videos to lip/facial visual forgeries, this paper proposes Spotlight: a system that embeds physically encrypted signatures into video at capture time via high-speed, invisible-light modulation, enabling real-time, tamper-resistant verification of speaker identity and lip-face motion consistency. Its core contributions are: (1) the first physical-layer optical modulation signature mechanism tailored for video; (2) a 150-bit pose-invariant audio-video joint feature generation framework ensuring semantic interpretability, cryptographic security, and dual imperceptibility to both video processing pipelines and human vision; and (3) an integrated verification algorithm combining locality-sensitive hashing, lightweight cryptographic binding, and robust feature extraction. Extensive evaluations across diverse scenarios achieve AUC ≥ 0.99 and 100% true positive rate, demonstrating strong robustness against post-processing, cross-device acquisition, and white-box attacks.
📝 Abstract
High-profile speech videos are prime targets for falsification, owing to their accessibility and influence. This work proposes Spotlight, a low-overhead and unobtrusive system for protecting live speech videos from visual falsification of speaker identity and lip and facial motion. Unlike predominant falsification detection methods operating in the digital domain, Spotlight creates dynamic physical signatures at the event site and embeds them into all video recordings via imperceptible modulated light. These physical signatures encode semantically-meaningful features unique to the speech event, including the speaker's identity and facial motion, and are cryptographically-secured to prevent spoofing. The signatures can be extracted from any video downstream and validated against the portrayed speech content to check its integrity. Key elements of Spotlight include (1) a framework for generating extremely compact (i.e., 150-bit), pose-invariant speech video features, based on locality-sensitive hashing; and (2) an optical modulation scheme that embeds>200 bps into video while remaining imperceptible both in video and live. Prototype experiments on extensive video datasets show Spotlight achieves AUCs $geq$ 0.99 and an overall true positive rate of 100% in detecting falsified videos. Further, Spotlight is highly robust across recording conditions, video post-processing techniques, and white-box adversarial attacks on its video feature extraction methodologies.