Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the challenge that current large language models struggle to effectively resolve conflicting information in streaming documents involving multiple concurrent events and lack dedicated evaluation benchmarks. To bridge this gap, the authors propose StreamBench—the first evaluation framework tailored for multi-event concurrent document streams—constructed from major news events in 2016 and 2025, encompassing 605 events and 15,354 documents. The framework introduces three core tasks: topic clustering, temporal question answering, and summarization. By incorporating structured cues to organize key event facts, the study systematically investigates their impact on enhancing models’ capabilities in information localization and event disentanglement. Experimental results demonstrate that structured cues improve topic clustering performance by up to 4.37% and temporal question answering by up to 9.63%, significantly boosting model comprehension in complex streaming text.

📝 Abstract

Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.

Problem

Research questions and friction points this paper is trying to address.

streaming evaluation

document streams

concurrent events

language models

temporal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

structural cues

document streams

temporal reasoning