🤖 AI Summary
Conventional CNNs for audio classification neglect inter-channel correlations, while quaternion CNNs (QCNNs) suffer from high computational complexity, hindering deployment. Method: This paper proposes a pruned quaternion neural network that jointly models multi-channel spectrogram features using quaternion algebra, integrated with knowledge distillation and structured pruning for model compression. Contribution/Results: The approach preserves QCNNs’ cross-channel joint representation capability while substantially reducing computational cost and parameter count—achieving 50% lower FLOPs and 80% fewer parameters on AudioSet, with significantly reduced inference latency and performance on par with baseline CNNs. Moreover, it demonstrates strong generalization across multiple benchmarks—including GTZAN, ESC-50, and RAVDESS—establishing a new paradigm for efficient audio understanding in resource-constrained settings.
📝 Abstract
Conventional Convolutional Neural Networks (CNNs) in the real domain have been widely used for audio classification. However, their convolution operations process multi-channel inputs independently, limiting the ability to capture correlations among channels. This can lead to suboptimal feature learning, particularly for complex audio patterns such as multi-channel spectrogram representations. Quaternion Convolutional Neural Networks (QCNNs) address this limitation by employing quaternion algebra to jointly capture inter-channel dependencies, enabling more compact models with fewer learnable parameters while better exploiting the multi-dimensional nature of audio signals. However, QCNNs exhibit higher computational complexity due to the overhead of quaternion operations, resulting in increased inference latency and reduced efficiency compared to conventional CNNs, posing challenges for deployment on resource-constrained platforms. To address this challenge, this study explores knowledge distillation (KD) and pruning, to reduce the computational complexity of QCNNs while maintaining performance. Our experiments on audio classification reveal that pruning QCNNs achieves similar or superior performance compared to KD while requiring less computational effort. Compared to conventional CNNs and Transformer-based architectures, pruned QCNNs achieve competitive performance with a reduced learnable parameter count and computational complexity. On the AudioSet dataset, pruned QCNNs reduce computational cost by 50% and parameter count by 80%, while maintaining performance comparable to the conventional CNNs. Furthermore, pruned QCNNs generalize well across multiple audio classification benchmarks, including GTZAN for music genre recognition, ESC-50 for environmental sound classification and RAVDESS for speech emotion recognition.