Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current non-autoregressive language models commonly rely on generation perplexity (gen-PPL) to evaluate text quality; however, this metric fails to adequately capture grammatical correctness and semantic coherence. This work proposes a zero-parameter naive sampler that achieves state-of-the-art gen-PPL on LM1B and OpenWebText yet produces clearly incoherent text, thereby systematically exposing the fundamental limitations of gen-PPL for the first time. To address this issue, we introduce direct evaluation methods based on distributional divergences—such as KL and Jensen–Shannon divergence—and scoring from pretrained autoregressive models. Our experiments demonstrate that these distribution-based metrics provide a more faithful and effective assessment of generation quality in unconditional text generation, establishing their necessity and superiority over conventional gen-PPL.

📝 Abstract

Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progress in both paradigms is overwhelmingly tracked by generative perplexity (gen-PPL): the per-token negative log-likelihood of samples under a frozen autoregressive (AR) scorer such as gpt2-large, typically paired with an empirical-entropy guardrail to rule out low-entropy collapse. We argue that this metric is unsound. By construction, gen-PPL measures only predictability under the scoring AR, not grammaticality or semantic coherence -- and the set of predictable but still low-quality sequences is combinatorially large. To make this concrete, we construct a suite of zero-parameter, deliberately naive samplers that achieve state-of-the-art gen-PPL on LM1B and OpenWebText at non-degenerate entropy, surpassing recently published diffusion and continuous-flow models while producing text that is incoherent by construction. We recommend evaluation suites that directly quantify the distributional divergence between generated and reference text, and use such a suite to re-benchmark recent non-autoregressive models, recovering a more faithful picture of the current state of the art.

Problem

Research questions and friction points this paper is trying to address.

generative perplexity

non-autoregressive language models

text evaluation

distributional metrics

semantic coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

generative perplexity

distributional metrics

non-autoregressive language models