LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

๐Ÿ“… 2026-06-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes Mel-LLM, a novel architecture that enables large language models (LLMs) to directly comprehend lightly preprocessed mel-spectrograms without relying on dedicated speech encoders, thereby achieving end-to-end unified speechโ€“language modeling. By partitioning spectrograms into patches and applying linear projection for LLM input, the method eliminates the need for conventional speech encoders. Leveraging multimodal initialization based on Phi-4-MM and a next-token variational autoencoder (VAE), Mel-LLM attains automatic speech recognition (ASR) performance approaching that of encoder-equipped models. Preliminary text-to-speech (TTS) results further demonstrate its feasibility, revealing the presence of encoder-agnostic universal representation layers within the LLM capable of capturing essential speech information.
๐Ÿ“ Abstract
Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.
Problem

Research questions and friction points this paper is trying to address.

speech-language modeling
encoder-free
Mel spectrogram
large language models
speech representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

encoder-free
speech-language modeling
Mel spectrogram
large language model
autoregressive modeling
๐Ÿ”Ž Similar Papers
2024-07-22arXiv.orgCitations: 4