🤖 AI Summary
While existing spoken language models can perceive paralinguistic cues—such as emotion and background noise—they often disregard this information in open-ended dialogue, leading to a disconnect between perception and response generation. This work proposes ParaBridge, the first method enabling models to autonomously leverage paralinguistic signals for appropriate responses without human annotations or external rewards, even in scaffold-free settings. ParaBridge employs an on-policy self-distillation mechanism, where a scaffolded perspective provides full-vocabulary, fine-grained supervision to guide the unscaffolded model in learning when to respond to paralinguistic cues. Experiments on Qwen3-Omni-thinking show that VoxSafeBench SAR improves from 14.6% to 40.3%, and EchoMind scores rise from 3.27 to 3.92, with negligible degradation in general capabilities, demonstrating strong cross-task and cross-model generalization.
📝 Abstract
Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.