🤖 AI Summary
To address the growing risk of emotional privacy leakage due to widespread deployment of speech technologies, this paper proposes a lightweight, on-device audio defense method. The approach uniquely repurposes familiar, user-accessible audio editing operations—such as pitch shifting and time stretching—as controllable emotional privacy protection mechanisms, requiring no auxiliary models or cloud dependency. By jointly perturbing pitch and tempo, it significantly degrades DNN- and LLM-based emotion recognition performance while preserving speech naturalness and editing usability. Rigorous adversarial evaluation and reversibility analysis confirm its robustness. Empirical results across three public benchmark datasets demonstrate an average reduction of over 40% in emotion recognition accuracy. The solution is implemented as a plug-and-play module compatible with mainstream Android and iOS audio applications, ensuring strong security guarantees, practical usability, and cross-platform interoperability.
📝 Abstract
The rapid proliferation of speech-enabled technologies, including virtual assistants, video conferencing platforms, and wearable devices, has raised significant privacy concerns, particularly regarding the inference of sensitive emotional information from audio data. Existing privacy-preserving methods often compromise usability and security, limiting their adoption in practical scenarios. This paper introduces a novel, user-centric approach that leverages familiar audio editing techniques, specifically pitch and tempo manipulation, to protect emotional privacy without sacrificing usability. By analyzing popular audio editing applications on Android and iOS platforms, we identified these features as both widely available and usable. We rigorously evaluated their effectiveness against a threat model, considering adversarial attacks from diverse sources, including Deep Neural Networks (DNNs), Large Language Models (LLMs), and and reversibility testing. Our experiments, conducted on three distinct datasets, demonstrate that pitch and tempo manipulation effectively obfuscates emotional data. Additionally, we explore the design principles for lightweight, on-device implementation to ensure broad applicability across various devices and platforms.