🤖 AI Summary
Existing audio Transformer models capture only pairwise acoustic relationships, limiting their ability to identify distinct sound objects—especially under low-shot conditions. To address this, we propose HG-GraphNet, a graph neural network for audio classification and tagging that jointly encodes local neighborhood graphs and high-order hypergraphs derived from fuzzy C-means clustering. Our approach is the first to integrate local binary relations with fuzzy-clustering-induced high-order cliques, eliminating the need for ImageNet pretraining. Evaluated on three public audio benchmarks, HG-GraphNet consistently outperforms Transformer baselines while reducing parameter count by 30–50%. Crucially, it achieves greater gains in low-resource settings, yielding an average +2.8% improvement in mean Average Precision (mAP). These results demonstrate the effectiveness and generalization advantage of explicitly modeling high-order acoustic object structures.
📝 Abstract
Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.