Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of maintaining consistency in long-form story generation by large language models, which often exhibit inconsistencies in facts, character attributes, and world rules. To this end, the authors introduce ConStory-Bench, a novel evaluation benchmark, along with ConStory-Checker, an automated detection tool. They establish the first fine-grained taxonomy of consistency errors, encompassing five major categories and nineteen subtypes, and propose an interpretable, evidence-based contradiction detection pipeline that integrates entropy analysis and multi-scenario prompt engineering. Experimental results reveal that consistency errors predominantly occur in factual and temporal dimensions, are most frequent in the middle segments of narratives, and correlate significantly with high-entropy text; furthermore, certain error types demonstrate notable co-occurrence patterns.

Technology Category

Application Category

📝 Abstract
What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

consistency bugs
long story generation
narrative consistency
large language models
storytelling
Innovation

Methods, ideas, or system contributions that make the work stand out.

narrative consistency
long-form story generation
consistency benchmark
automated contradiction detection
large language models
🔎 Similar Papers
No similar papers found.