Emotion and Acoustics Should Agree: Cross-Level Inconsistency Analysis for Audio Deepfake Detection

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel approach to audio deepfake detection that addresses the limitation of existing methods in overlooking fine-grained inconsistencies between emotional and acoustic features. By leveraging cross-level (frame- and utterance-level) emotional–acoustic incongruence as a core discriminative cue, the method projects both feature types into a comparable space through representation projection and integrates multi-granularity information to model their inconsistency. This strategy overcomes the constraints of conventional approaches that either rely on enhancing feature correlation or model modalities in isolation. Evaluated on the ASVspoof 2019 LA and 2021 LA datasets, the proposed method significantly outperforms current baselines, demonstrating its effectiveness in advancing anti-spoofing performance for audio forensics.

Technology Category

Application Category

📝 Abstract
Audio Deepfake Detection (ADD) aims to detect spoof speech from bonafide speech. Most prior studies assume that stronger correlations within or across acoustic and emotional features imply authenticity, and thus focus on enhancing or measuring such correlations. However, existing methods often treat acoustic and emotional features in isolation or rely on correlation metrics, which overlook subtle desynchronization between them and smooth out abrupt discontinuities. To address these issues, we propose EAI-ADD, which treats cross level emotion acoustic inconsistency as the primary detection signal. We first project emotional and acoustic representations into a comparable space. Then we progressively integrate frame level and utterance level emotion features with acoustic features to capture cross level emotion acoustic inconsistencies across different temporal granularities. Experimental results on the ASVspoof 2019LA and 2021LA datasets demonstrate that the proposed EAI-ADD outperforms baselines, providing a more effective solution for audio anti spoofing detection.
Problem

Research questions and friction points this paper is trying to address.

Audio Deepfake Detection
Emotion-Acoustic Inconsistency
Cross-Level Analysis
Spoof Speech
Temporal Desynchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-level inconsistency
emotion-acoustic desynchronization
audio deepfake detection
multi-granularity fusion
EAI-ADD
🔎 Similar Papers
No similar papers found.
Jinhua Zhang
Jinhua Zhang
University of Electronic Science and Technology of China
Visual Generation
Z
Zhenqi Jia
Inner Mongolia University
R
Rui Liu
Inner Mongolia University