SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large audio language models, which are hindered by the scarcity of high-quality annotated data and insufficient fine-grained time–frequency perception. To overcome these challenges, the authors propose SpectCount, a novel method that enables efficient model fine-tuning through online generation of purposefully designed synthetic audio signals—without requiring real audio recordings, human annotations, or pretrained generative models. SpectCount substantially enhances the model’s time–frequency awareness and cross-domain comprehension across unseen auditory tasks involving sound, music, and speech. The results demonstrate that purely synthetic signals can effectively and data-efficiently rectify model weaknesses and improve generalization performance.
📝 Abstract
Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.
Problem

Research questions and friction points this paper is trying to address.

large audio language models
data scarcity
spectrotemporal perception
audio understanding
annotated audio data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectrotemporal Counting
synthetic audio signals
large audio language models
data-efficient fine-tuning
auditory understanding
🔎 Similar Papers
No similar papers found.
S
Seonuk Kim
Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea
Y
Yonghyeon Jun
Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea
Ju Yeon Kang
Ju Yeon Kang
Seoul National University
deep learningspeech signal processing
J
Jimin Hong
Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea
Y
Yoonhyeong Lee
Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea
Nam Soo Kim
Nam Soo Kim
Seoul National University, Department of Electrical and Computer Engineering