๐ค AI Summary
This work proposes Mel-LLM, a novel architecture that enables large language models (LLMs) to directly comprehend lightly preprocessed mel-spectrograms without relying on dedicated speech encoders, thereby achieving end-to-end unified speechโlanguage modeling. By partitioning spectrograms into patches and applying linear projection for LLM input, the method eliminates the need for conventional speech encoders. Leveraging multimodal initialization based on Phi-4-MM and a next-token variational autoencoder (VAE), Mel-LLM attains automatic speech recognition (ASR) performance approaching that of encoder-equipped models. Preliminary text-to-speech (TTS) results further demonstrate its feasibility, revealing the presence of encoder-agnostic universal representation layers within the LLM capable of capturing essential speech information.
๐ Abstract
Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.