🤖 AI Summary
This work addresses the challenge of streaming large-scale JSON/XML documents, where queries must return results as early as possible without constructing a full parse tree while minimizing memory consumption. The authors propose a streaming processing model based on path and filter conditions, integrating the expressive power of monadic second-order (MSO) logic with an iterator-based mechanism. They establish, for the first time, that all unary MSO-expressible queries admit earliest-answer evaluation with constant-time updates per input token. Building on this theoretical guarantee, they design a low-latency, low-memory streaming query system capable of handling complex XPath- and JSONPath-like queries, achieving optimal constant-time update performance in theory.
📝 Abstract
Streaming allows executing queries over massive JSON or XML documents whose size makes it infeasible to fully parse them into a tree. Earliest query answering is a radical approach to reducing latency and memory footprint. To minimize latency, a document node must be returned as soon as the node is guaranteed to be an answer regardless of how the document ends. Similarly, to minimize memory footprint, a node must be discarded as soon as it cannot become an answer regardless of how the document ends. For simple queries that select nodes based on the path from the root, the decision for each node can be made on the spot, but practical languages such as XPath or JSONpath support filters, which allow selecting nodes based on information collected from various parts of the document, possibly further down the stream. This makes earliest query answering a challenging task, as candidate nodes must be kept in memory until it becomes clear that they can be safely returned or discarded. We show that this can be done for all unary queries expressible in monadic second order logic (MSO), while ensuring constant update time -- provided that nodes are returned by passing a suitable iterator, rather than one by one.