🤖 AI Summary
This work addresses stability issues in nnAudio arising from TorchScript incompatibility, edge cases in inverse transforms, and dependency drift in modern PyTorch environments. We refactor the STFT and iSTFT implementations to eliminate dynamic state changes, standardize parameter handling, and explicitly restrict reliable inverse STFT reconstruction to uniformly spaced frequency bins—raising errors for unsupported configurations. Additionally, we ensure that the variable-Q transform (VQT) strictly degenerates to the constant-Q transform (CQT) when γ = 0 and resolve compatibility issues between the cepstral feature pipeline (CFP) and recent SciPy versions. Through differentiable audio transforms, TorchScript-compatible static compilation, and updated dependencies, our implementation achieves robust deployment, validated by comprehensive regression testing, significantly enhancing the reliability and robustness of audio feature extraction for both research and engineering applications.
📝 Abstract
nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.