๐ค AI Summary
Existing foundational audio models (e.g., SSAST, EAT, HuBERT) suffer from limited generalizability and reusability due to fixed sampling rates and input duration constraints. To address this, we propose AMAuTโthe first scratch-trained multi-view audio Transformer framework supporting arbitrary sampling rates and variable-length audio inputs. Our method introduces four key innovations: (1) an augmentation-driven learning paradigm; (2) a Conv1-Conv7-Conv1 bottleneck architecture for efficient spectral-temporal feature extraction; (3) a dual-token mechanism (CLS + TAL) capturing bidirectional contextual dependencies; and (4) test-time adaptive augmentation (TTAยฒ) for robust inference. AMAuT requires no pretraining, instead leveraging multi-view data augmentation and contrastive learning to enhance robustness. Evaluated on five public benchmarks, it achieves up to 99.8% accuracy while incurring less than 3% of the training cost of comparable pretrained models. This yields substantial improvements in flexibility, cross-dataset generalization, and feasibility for edge deployment.
๐ Abstract
Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.