LALM-Eval: An Open-Source Toolkit for Holistic Evaluation of Large Audio Language Models

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large audio-language models (LALMs) face three critical evaluation bottlenecks: low efficiency hindering large-scale experimentation, inconsistent prompt design undermining reproducibility, and narrow task coverage—particularly lacking benchmarks for speaker diarization and spoken-language reasoning. This paper introduces the first efficient, open-source LALM evaluation toolkit. It features a standardized prompt protocol, LLM-adaptive task configuration, and a parallel batch-processing architecture, achieving a 127% speedup in evaluation throughput. Crucially, it pioneers two novel, temporally sensitive evaluation dimensions: speaker diarization and multi-step spoken-language reasoning—exposing systematic deficiencies in complex instruction following and temporal understanding across state-of-the-art models. Validated on 380+ tasks, our framework reveals that modality-inconsistent instruction formulation can induce up to a 9.5-percentage-point performance gap, establishing a robust, fine-grained benchmark for LALM capability assessment and iterative development.

Technology Category

Application Category

📝 Abstract
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce LALM-Eval, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. LALM-Eval provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.
Problem

Research questions and friction points this paper is trying to address.

Evaluating large audio language models faces inefficient toolkit challenges
Current frameworks suffer from slow processing and inconsistent prompting
Narrow task coverage misses key audio reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized batch processing for speedup
Standardized prompting protocols for fairness
New evaluation categories for audio reasoning
S
Sidharth Surapaneni
University of Texas at Austin
H
Hoang Nguyen
ServiceNow
J
Jash Mehta
ServiceNow
A
Aman Tiwari
ServiceNow
Oluwanifemi Bamgbose
Oluwanifemi Bamgbose
University of Waterloo
A
Akshay Kalkunte
ServiceNow
Sai Rajeswar
Sai Rajeswar
Staff Research Scientist, Adjunct Professor, Mila, ServiceNow
machine learninggenerative modelsreinforcement learning
S
Sathwik Tejaswi Madhusudhan
ServiceNow