AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing foundational audio models (e.g., SSAST, EAT, HuBERT) suffer from limited generalizability and reusability due to fixed sampling rates and input duration constraints. To address this, we propose AMAuT—the first scratch-trained multi-view audio Transformer framework supporting arbitrary sampling rates and variable-length audio inputs. Our method introduces four key innovations: (1) an augmentation-driven learning paradigm; (2) a Conv1-Conv7-Conv1 bottleneck architecture for efficient spectral-temporal feature extraction; (3) a dual-token mechanism (CLS + TAL) capturing bidirectional contextual dependencies; and (4) test-time adaptive augmentation (TTA²) for robust inference. AMAuT requires no pretraining, instead leveraging multi-view data augmentation and contrastive learning to enhance robustness. Evaluated on five public benchmarks, it achieves up to 99.8% accuracy while incurring less than 3% of the training cost of comparable pretrained models. This yields substantial improvements in flexibility, cross-dataset generalization, and feasibility for edge deployment.

Technology Category

Application Category

📝 Abstract

Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of fixed input rates and durations in audio models

Eliminates dependency on pre-trained weights for audio classification

Provides efficient audio processing for computationally constrained settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training from scratch without pre-trained weights dependency

Supporting arbitrary audio sample rates and lengths

Using one-dimensional CNN bottleneck for temporal encoding

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs

2024-08-29arXiv.orgCitations: 2

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

2024-09-01InterspeechCitations: 2

Authors to Follow