🤖 AI Summary
Current non-autoregressive language models commonly rely on generation perplexity (gen-PPL) to evaluate text quality; however, this metric fails to adequately capture grammatical correctness and semantic coherence. This work proposes a zero-parameter naive sampler that achieves state-of-the-art gen-PPL on LM1B and OpenWebText yet produces clearly incoherent text, thereby systematically exposing the fundamental limitations of gen-PPL for the first time. To address this issue, we introduce direct evaluation methods based on distributional divergences—such as KL and Jensen–Shannon divergence—and scoring from pretrained autoregressive models. Our experiments demonstrate that these distribution-based metrics provide a more faithful and effective assessment of generation quality in unconditional text generation, establishing their necessity and superiority over conventional gen-PPL.
📝 Abstract
Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progress in both paradigms is overwhelmingly tracked by generative perplexity (gen-PPL): the per-token negative log-likelihood of samples under a frozen autoregressive (AR) scorer such as gpt2-large, typically paired with an empirical-entropy guardrail to rule out low-entropy collapse. We argue that this metric is unsound. By construction, gen-PPL measures only predictability under the scoring AR, not grammaticality or semantic coherence -- and the set of predictable but still low-quality sequences is combinatorially large. To make this concrete, we construct a suite of zero-parameter, deliberately naive samplers that achieve state-of-the-art gen-PPL on LM1B and OpenWebText at non-degenerate entropy, surpassing recently published diffusion and continuous-flow models while producing text that is incoherent by construction. We recommend evaluation suites that directly quantify the distributional divergence between generated and reference text, and use such a suite to re-benchmark recent non-autoregressive models, recovering a more faithful picture of the current state of the art.