🤖 AI Summary
This work addresses the challenge that conventional speech models struggle to jointly optimize performance and computational complexity during training due to their non-differentiable architectural parameters. To overcome this limitation, the authors propose a reparameterization method based on feature noise injection, which for the first time enables end-to-end differentiable, dynamic adjustment of model architecture during training. This approach facilitates simultaneous optimization of accuracy and FLOP/s without relying on post-hoc pruning or quantization. By integrating differentiable architecture search with standard SGD optimization, the method significantly reduces computational overhead while maintaining strong performance on both voice activity detection and audio anti-spoofing tasks. The implementation has been made publicly available.
📝 Abstract
In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.