Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
This study investigates the impact of increasingly realistic audio deepfakes on human ability to identify genuine speech and on trust in authentic audio. Through a large-scale listening experiment involving 1,768 participants and 35,532 judgments, the authors systematically evaluated deepfake audio generated by 138 diverse speech synthesis systems, including commercial platforms, autoregressive models, sequence-to-sequence architectures, and flow-matching approaches. The work reveals a novel “suspicion shift” phenomenon: while detection accuracy for fake audio remains stable at approximately 72%, trust in real audio significantly declines, with identification accuracy dropping from 72.7% to 64.1%. These findings suggest that the primary societal threat of deepfakes lies not in evading detection but in eroding confidence in genuine audio content. Integrating human subjective assessments with high-accuracy machine detectors (>94.5%), this research provides critical empirical evidence for understanding the broader social implications of audio deepfakes.
📝 Abstract
Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.
Problem

Research questions and friction points this paper is trying to address.

audio deepfakes
trust erosion
human perception
real speech
fake detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio deepfake
human perception
trust erosion
voice synthesis detection
large-scale listening study