🤖 AI Summary
This study addresses key challenges in real-world speech prosody collection—namely semantic confounding, privacy risks, and low participant compliance—by introducing the first field-deployable protocol that simultaneously ensures content control, privacy preservation, and scalability. The proposed approach employs standardized scripts calibrated for emotional valence to constrain semantic content, leverages on-device smartphone processing to extract prosodic features in real time, and immediately discards raw audio after feature derivation, uploading only the anonymized features. In a large-scale deployment involving 560 participants, the protocol successfully yielded 9,877 high-quality recordings. These data effectively supported accurate prediction of speaker gender and momentary affective states (valence and arousal), thereby validating both the data quality and practical utility of the proposed framework.
📝 Abstract
Collecting everyday speech data for prosodic analysis is challenging due to the confounding of prosody and semantics, privacy constraints, and participant compliance. We introduce and empirically evaluate a content-controlled, privacy-first smartphone protocol that uses scripted read-aloud sentences to standardize lexical content (including prompt valence) while capturing natural variation in prosodic delivery. The protocol performs on-device prosodic feature extraction, deletes raw audio immediately, and transmits only derived features for analysis. We deployed the protocol in a large study (N = 560; 9,877 recordings), evaluated compliance and data quality, and conducted diagnostic prediction tasks on the extracted features, predicting speaker sex and concurrently reported momentary affective states (valence, arousal). We discuss implications and directions for advancing and deploying the protocol.