🤖 AI Summary
This study addresses the unresolved question of how audio effects—such as reverb, distortion, modulation, and dynamic processing—systematically influence perceived musical emotion. We propose, for the first time, an analysis framework grounded in multimodal foundation models. Methodologically, we leverage large-scale audiovisual pre-trained foundation models to extract rich music representations and employ structured probing techniques to disentangle nonlinear associations between effect-specific acoustic features and affective dimensions (e.g., arousal and valence), while evaluating cross-style and cross-production robustness. Our key contributions are threefold: (1) identification of distinct, effect-specific modulation mechanisms over emotional dimensions; (2) empirical validation of foundation models as interpretable tools for mapping acoustic features to affective semantics; and (3) establishment of a data-driven paradigm for evaluating the emotional impact of sound design choices.
📝 Abstract
Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such models encode rich associations between musical structure, timbre, and affective meaning, offering a powerful framework for probing the emotional consequences of sound design techniques. By applying various probing methods to embeddings from deep learning models, we examine the complex, nonlinear relationships between audio FX and estimated emotion, uncovering patterns tied to specific effects and evaluating the robustness of foundation audio models. Our findings aim to advance understanding of the perceptual impact of audio production practices, with implications for music cognition, performance, and affective computing.