MOSS-Audio Technical Report

๐Ÿ“… 2026-06-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

183K/year
๐Ÿค– AI Summary
This work addresses unified multimodal audio understanding across speech, environmental sounds, and music through a suite of tasks including audio captioning, temporal question answering, timestamped transcription, and audio-based reasoning. The authors propose a unified audioโ€“language model architecture that integrates a dedicated audio encoder, modality adapters, and a large language model. Key innovations include DeepStack cross-layer feature injection and a temporal token mechanism to enhance time-aware modeling, alongside an event-preserving audio annotation pipeline. Leveraging 12.5 Hz temporal encoding and a multi-stage post-training strategy, the model achieves strong performance across diverse general and speech-centric audio benchmarks. The study releases both Instruct and Thinking variants at 4B and 8B parameter scales.
๐Ÿ“ Abstract
MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.
Problem

Research questions and friction points this paper is trying to address.

audio-language model
unified audio understanding
temporal grounding
audio captioning
time-aware reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeepStack cross-layer feature injection
time markers
event-preserving audio annotation
audio-language pretraining
temporal grounding
๐Ÿ”Ž Similar Papers
No similar papers found.