MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

234K/year
🤖 AI Summary
This work addresses the inefficiency of traditional audio frontends—typically comprising short-time Fourier transform (STFT) followed by Mel-scale filtering—whose multi-stage pipeline incurs high memory bandwidth overhead and scheduling bottlenecks due to poor alignment with modern accelerators optimized for dense linear algebra. To overcome this, the authors propose MelT, a novel approach that precomputes the non-uniform discrete Fourier transform (NDFT) basis corresponding to the Mel scale and reformulates the entire frontend as a single-stage, GEMM-native operation directly applied to time-domain audio frames. By leveraging the computational characteristics of contemporary hardware—from Apple A18 Pro to NVIDIA H100—MelT achieves up to 3.75× faster inference and 3.52× lower energy consumption across diverse platforms, all while preserving downstream task accuracy.
📝 Abstract
Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally heterogeneous. This mismatch can introduce memory-bandwidth, dispatch, and intermediate-allocation overheads on contemporary accelerator backends. This work introduces MelT, a single-stage frontend framework in which Mel-spaced Non-Uniform Discrete Fourier Transform (NDFT) bases are precomputed and applied to time-domain acoustic frames through dense General Matrix Multiplication (GEMM) operations. The contribution is not the NDFT operator itself; rather, it is the formulation of Mel-spaced NDFT projection as a GEMM-native audio frontend and its evaluation as a hardware-efficient alternative to conventional STFT+Mel pipelines. Evaluated across platforms ranging from Apple A18 Pro edge hardware to NVIDIA H100 datacenter acceleration, MelT attains up to a $3.75\times$ speedup in inference latency and a $3.52\times$ reduction in energy consumption while maintaining downstream classification accuracy.
Problem

Research questions and friction points this paper is trying to address.

audio frontend
accelerator efficiency
STFT
Mel spectrogram
memory bandwidth overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

GEMM-native
NDFT
single-stage frontend
audio processing
hardware efficiency
🔎 Similar Papers