SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of large audio language models, which are hindered by the scarcity of high-quality annotated data and insufficient fine-grained time–frequency perception. To overcome these challenges, the authors propose SpectCount, a novel method that enables efficient model fine-tuning through online generation of purposefully designed synthetic audio signals—without requiring real audio recordings, human annotations, or pretrained generative models. SpectCount substantially enhances the model’s time–frequency awareness and cross-domain comprehension across unseen auditory tasks involving sound, music, and speech. The results demonstrate that purely synthetic signals can effectively and data-efficiently rectify model weaknesses and improve generalization performance.

📝 Abstract

Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of high-quality annotated audio data remains a fundamental bottleneck for scaling. Through probing signal detectability analysis, we identify fine-grained spectrotemporal perceptual weaknesses in a foundation LALM. To address these challenges, we propose Spectrotemporal Counting (SpectCount), a data-efficient fine-tuning approach based on fully synthetic audio signals generated on-the-fly, without relying on real-world audio, annotations, or pretrained generative models. SpectCount not only resolves the observed weaknesses but also improves performance on diverse auditory benchmarks spanning sound, music, and speech, unseen during fine-tuning. These results suggest that weakness-targeted synthetic signals provide a data-efficient path toward enhanced auditory understanding capabilities in LALMs.

Problem

Research questions and friction points this paper is trying to address.

large audio language models

data scarcity

spectrotemporal perception

audio understanding

annotated audio data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectrotemporal Counting

synthetic audio signals

large audio language models