🤖 AI Summary
Edge-based speaker authentication systems face dual threats from deepfake audio attacks targeting the data plane and model poisoning attacks compromising the control plane in federated learning.
Method: We propose a physics-guided and uncertainty-aware joint defense framework that integrates vocal tract dynamics modeling with self-supervised multimodal representation learning, implemented as a Bayesian deep learning detection architecture for simultaneous mitigation of data-plane deepfakes and control-plane model poisoning.
Contribution/Results: This work pioneers the tight integration of interpretable acoustic physical priors with Bayesian uncertainty estimation, significantly enhancing robustness and explainability against novel adversarial attacks. Experiments demonstrate high detection accuracy (>98.2%) and strong poisoning resistance—reducing poisoning success rate to <3.1%—under complex adversarial conditions. To our knowledge, this is the first multimodal solution for edge speaker authentication that jointly ensures physical interpretability, statistical reliability, and distributed security.
📝 Abstract
Voice authentication systems deployed at the network edge face dual threats: a) sophisticated deepfake synthesis attacks and b) control-plane poisoning in distributed federated learning protocols. We present a framework coupling physics-guided deepfake detection with uncertainty-aware in edge learning. The framework fuses interpretable physics features modeling vocal tract dynamics with representations coming from a self-supervised learning module. The representations are then processed via a Multi-Modal Ensemble Architecture, followed by a Bayesian ensemble providing uncertainty estimates. Incorporating physics-based characteristics evaluations and uncertainty estimates of audio samples allows our proposed framework to remain robust to both advanced deepfake attacks and sophisticated control-plane poisoning, addressing the complete threat model for networked voice authentication.