🤖 AI Summary
This work addresses the “length collapse” phenomenon in large language models, wherein coherence and controllability significantly degrade when generating open-ended texts exceeding 2,000 words. To mitigate this, the authors propose the Interleaved Structural Chain-of-Thought (IS-CoT) framework, which introduces, for the first time, an internal mechanism that dynamically interleaves structured reasoning with text generation. By embedding a Plan-Write-Reflect loop, IS-CoT enables continuous self-adjustment of writing strategies without external intervention while maintaining global alignment. Leveraging a high-quality dataset of interleaved reasoning trajectories constructed via multi-teacher distillation, the authors train IS-Writer-8B, which achieves state-of-the-art performance on benchmarks such as LongBench-Write—surpassing DeepSeek-V3.2 by 3.08 points—and demonstrates length compliance and coherence comparable to much larger closed-source models.
📝 Abstract
Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they suffer from a severe length collapse in open-ended writing, where performance degrades sharply as target lengths exceed 2,000 words. We attribute this failure to the limitation of static hierarchical planning, which struggles to provide dynamic guidance over extended contexts. To bridge this gap, we introduce the Interleaved Structural Chain-of-Thought (IS-CoT) framework. Unlike external agentic workflows, IS-CoT embeds a dynamic Plan-Write-Reflect cycle into the generation process, enabling continuous strategy adaptation and global alignment without additional assistance. Based on this framework, we construct a high-quality dataset of interleaved reasoning traces via a multi-teacher pipeline and train IS-Writer-8B. Experiments demonstrate that IS-Writer-8B achieves state-of-the-art performance on challenging long-form benchmarks (e.g., +3.08 vs. DeepSeek-V3.2 on LongBench-Write), exhibiting robust length compliance and coherence competitive with significantly larger proprietary models.