The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Multimodal large language models (MLLMs) pose novel audio privacy risks by inferring sensitive personal attributes—such as age, gender, emotion, and socioeconomic status—from covertly captured audio. Addressing two key challenges—(i) the absence of benchmark datasets with fine-grained sensitive attribute annotations, and (ii) MLLMs’ limited direct audio reasoning capability—this work makes three contributions: (1) it formally defines the concept of *audio private attribute profiling*; (2) it introduces AP², the first large-scale, multi-dimensional audio benchmark dataset annotated with diverse sensitive attributes; and (3) it proposes Gifts, a multi-agent framework integrating audio-language models (ALMs) and large language models (LLMs), leveraging prompt engineering, inference guidance, and consistency analysis to mitigate hallucination and significantly improve both accuracy and robustness in sensitive attribute inference. Experiments demonstrate that Gifts substantially outperforms existing baselines; additionally, preliminary defense strategies are explored at both data and model levels.

Technology Category

Application Category

📝 Abstract

Our research uncovers a novel privacy risk associated with multimodal large language models (MLLMs): the ability to infer sensitive personal attributes from audio data -- a technique we term audio private attribute profiling. This capability poses a significant threat, as audio can be covertly captured without direct interaction or visibility. Moreover, compared to images and text, audio carries unique characteristics, such as tone and pitch, which can be exploited for more detailed profiling. However, two key challenges exist in understanding MLLM-employed private attribute profiling from audio: (1) the lack of audio benchmark datasets with sensitive attribute annotations and (2) the limited ability of current MLLMs to infer such attributes directly from audio. To address these challenges, we introduce AP^2, an audio benchmark dataset that consists of two subsets collected and composed from real-world data, and both are annotated with sensitive attribute labels. Additionally, we propose Gifts, a hybrid multi-agent framework that leverages the complementary strengths of audio-language models (ALMs) and large language models (LLMs) to enhance inference capabilities. Gifts employs an LLM to guide the ALM in inferring sensitive attributes, then forensically analyzes and consolidates the ALM's inferences, overcoming severe hallucinations of existing ALMs in generating long-context responses. Our evaluations demonstrate that Gifts significantly outperforms baseline approaches in inferring sensitive attributes. Finally, we investigate model-level and data-level defense strategies to mitigate the risks of audio private attribute profiling. Our work validates the feasibility of audio-based privacy attacks using MLLMs, highlighting the need for robust defenses, and provides a dataset and framework to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

MLLMs can infer sensitive attributes from audio data

Lack of annotated audio datasets for privacy profiling

Current MLLMs struggle with direct audio attribute inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AP^2 audio benchmark dataset

Proposes Gifts multi-agent framework

Combines ALMs and LLMs strengths

🔎 Similar Papers

No similar papers found.

Authors to Follow