Beyond the Voice: Inertial Sensing of Mouth Motion for High Security Speech Verification

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the vulnerability of voice authentication to high-fidelity synthetic speech attacks in high-stakes scenarios, this paper proposes an acoustic–oro-facial motion dual-factor authentication method. It introduces, for the first time, a lightweight perioral inertial sensor to capture individual-specific lip dynamics, which are fused with temporal acoustic features for joint modeling. Unlike vision-based approaches, the method is camera-free, exhibits strong discriminability across individuals, and maintains robust performance under challenging conditions—including walking, stair climbing, and multilingual speech. Evaluated on a real-world dataset comprising 43 subjects across diverse scenarios, the system achieves a median equal error rate (EER) below 0.01, significantly outperforming unimodal baselines. The key contribution lies in pioneering the use of oro-facial inertial motion as a second biometric modality, thereby establishing a highly robust, low-power, vision-free framework for enhanced voice authentication.

Technology Category

Application Category

📝 Abstract

Voice interfaces are increasingly used in high stakes domains such as mobile banking, smart home security, and hands free healthcare. Meanwhile, modern generative models have made high quality voice forgeries inexpensive and easy to create, eroding confidence in voice authentication alone. To strengthen protection against such attacks, we present a second authentication factor that combines acoustic evidence with the unique motion patterns of a speaker's lower face. By placing lightweight inertial sensors around the mouth to capture mouth opening and evolving lower facial geometry, our system records a distinct motion signature with strong discriminative power across individuals. We built a prototype and recruited 43 participants to evaluate the system under four conditions seated, walking on level ground, walking on stairs, and speaking with different language backgrounds (native vs. non native English). Across all scenarios, our approach consistently achieved a median equal error rate (EER) of 0.01 or lower, indicating that mouth movement data remain robust under variations in gait, posture, and spoken language. We discuss specific use cases where this second line of defense could provide tangible security benefits to voice authentication systems.

Problem

Research questions and friction points this paper is trying to address.

Combining acoustic data with mouth motion patterns for enhanced voice authentication security

Using inertial sensors to capture unique lower facial movement signatures during speech

Addressing vulnerability of voice systems to forgery attacks through multimodal biometric verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines acoustic data with facial motion patterns

Uses inertial sensors to capture mouth movement signatures

Achieves robust authentication across various physical activities

🔎 Similar Papers

No similar papers found.

Authors to Follow