🤖 AI Summary
Existing large concept models (LCMs) suffer from a trade-off between semantic abstraction and generation quality, primarily due to likelihood-free training objectives—such as mean squared error or diffusion-based losses—that sacrifice principled probabilistic modeling. To address this, we propose a decoder-only Transformer architecture that, for the first time, enables semantic-level autoregressive reasoning within a frozen SONAR sentence embedding space: the model predicts sentence embeddings autoregressively, while gradients are backpropagated through a frozen SONAR decoder to optimize token-level cross-entropy loss. This hybrid training paradigm preserves high-level semantic abstraction, eliminates inefficient diffusion sampling, and restores likelihood-based optimization signals. Our approach achieves strong performance across model scales—from 39M to 1.3B parameters—and all code and checkpoints are publicly released to advance reproducible large-concept modeling research.
📝 Abstract
The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.