Representation Matters in Randomized Smoothing for Audio Classification

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the ambiguity in robustness certification for audio classification via randomized smoothing, which arises from unspecified preprocessing steps—such as normalization and feature transformation—that obscure both the certified object and the perturbation model. The study systematically investigates the impact of applying randomized smoothing at different stages: raw waveform, feature space, and post-processing. It reveals, for the first time, the substantial influence of audio representation on certification outcomes and proposes a “representation-aware” reporting protocol. Experiments on keyword spotting and environmental sound classification demonstrate that, under identical smoothing strengths, datasets with differing waveform energy exhibit markedly distinct effective signal-to-noise ratio scales. Moreover, log-mel feature-level smoothing yields higher certified accuracy with non-zero radii (68.42% vs. 65.53%) on environmental sound tasks, while clipping or peak normalization can alter the effective perturbation norm by factors of 230–351.

📝 Abstract

Randomized smoothing (RS) certifies robustness in the vector space where Gaussian noise is added. In audio classification, this space is often not uniquely defined as standard pipelines normalize, range-control, and transform waveforms into log-mel or other spectral features. We show that direct RS is therefore under-specified unless the certified object and preprocessing policy are explicit. On two audio benchmarks, keyword spotting and environmental-sound classification, we study waveform, feature-space, and post-processed smoothing. Our diagnostics show why representation-aware reporting is necessary: at the same smoothing level $σ=0.0025$, the two datasets share the same median raw radius $.007996$, but different waveform energies yield different SNR-equivalent scales ($83.98$ vs. $90.97$ dB); log-mel smoothing gives higher positive-radius certified accuracy on environmental sounds ($68.42\%$ vs. $65.53\%$), certifying more examples with nonzero radius but over features rather than waveforms; and clipping or peak normalization changes the effective perturbation norm by roughly $230$--$351\times$. We therefore recommend that audio RS studies choose and report the task-specific certified object and perturbation model, including the perturbation location, gain policy, raw radius, and any post-noise geometry changes.

Problem

Research questions and friction points this paper is trying to address.

Randomized Smoothing

Audio Classification

Robustness Certification

Representation

Perturbation Model

Innovation

Methods, ideas, or system contributions that make the work stand out.

randomized smoothing

audio classification

representation awareness