AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

๐Ÿ“… 2025-10-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing foundational audio models (e.g., SSAST, EAT, HuBERT) suffer from limited generalizability and reusability due to fixed sampling rates and input duration constraints. To address this, we propose AMAuTโ€”the first scratch-trained multi-view audio Transformer framework supporting arbitrary sampling rates and variable-length audio inputs. Our method introduces four key innovations: (1) an augmentation-driven learning paradigm; (2) a Conv1-Conv7-Conv1 bottleneck architecture for efficient spectral-temporal feature extraction; (3) a dual-token mechanism (CLS + TAL) capturing bidirectional contextual dependencies; and (4) test-time adaptive augmentation (TTAยฒ) for robust inference. AMAuT requires no pretraining, instead leveraging multi-view data augmentation and contrastive learning to enhance robustness. Evaluated on five public benchmarks, it achieves up to 99.8% accuracy while incurring less than 3% of the training cost of comparable pretrained models. This yields substantial improvements in flexibility, cross-dataset generalization, and feasibility for edge deployment.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of fixed input rates and durations in audio models
Eliminates dependency on pre-trained weights for audio classification
Provides efficient audio processing for computationally constrained settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training from scratch without pre-trained weights dependency
Supporting arbitrary audio sample rates and lengths
Using one-dimensional CNN bottleneck for temporal encoding
W
Weichuang Shao
School of Computer and Mathematical Sciences, University of Nottingham Malaysia, Semenyih, Malaysia
Iman Yi Liao
Iman Yi Liao
University of Nottingham Malaysia Campus
Computer VisionImage ProcessingMachine LearningComputer Graphics
T
Tomas Henrique Bode Maul
School of Computer and Mathematical Sciences, University of Nottingham Malaysia, Semenyih, Malaysia
Tissa Chandesa
Tissa Chandesa
Professor (Assistant), School of Computer Science, University of Nottingham Malaysia
Image ProcessingComputer VisionDeep & Machine LearningGenerative AI