SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large concept models (LCMs) suffer from a trade-off between semantic abstraction and generation quality, primarily due to likelihood-free training objectives—such as mean squared error or diffusion-based losses—that sacrifice principled probabilistic modeling. To address this, we propose a decoder-only Transformer architecture that, for the first time, enables semantic-level autoregressive reasoning within a frozen SONAR sentence embedding space: the model predicts sentence embeddings autoregressively, while gradients are backpropagated through a frozen SONAR decoder to optimize token-level cross-entropy loss. This hybrid training paradigm preserves high-level semantic abstraction, eliminates inefficient diffusion sampling, and restores likelihood-based optimization signals. Our approach achieves strong performance across model scales—from 39M to 1.3B parameters—and all code and checkpoints are publicly released to advance reproducible large-concept modeling research.

Technology Category

Application Category

📝 Abstract
The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.
Problem

Research questions and friction points this paper is trying to address.

Hybrid training for semantic and token-level text generation
Eliminating diffusion sampler in autoregressive transformers
Improving scalability and quality in sentence embedding models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses continuous SONAR embedding space
Hybrid token-level cross-entropy supervision
Eliminates diffusion sampler for efficiency
🔎 Similar Papers
No similar papers found.