π€ AI Summary
Long-document question answering faces a fundamental trade-off between global context modeling and fine-grained retrieval: end-to-end processing incurs prohibitive computational cost, while conventional chunk-based RAG methods often fail to capture long-range dependencies. To address this, we propose DAG-RAGβa dynamic, controllable graph-structured retrieval framework driven by LLM attention weights. DAG-RAG constructs a hierarchical, weighted directed acyclic graph (DAG) over the document, enabling event-level node abstraction, many-to-many summary edges, and adaptive multi-path retrieval. It further incorporates a dynamic termination strategy, allowing the LLM to regulate retrieval depth and scope in real time. Evaluated on four single- and multi-document QA benchmarks, DAG-RAG consistently outperforms Llama 3.1 and all state-of-the-art RAG baselines, while maintaining inference overhead comparable to standard RAG approaches.
π Abstract
In the past, Retrieval-Augmented Generation (RAG) methods split text into chunks to enable language models to handle long documents. Recent tree-based RAG methods are able to retrieve detailed information while preserving global context. However, with the advent of more powerful LLMs, such as Llama 3.1, which offer better comprehension and support for longer inputs, we found that even recent tree-based RAG methods perform worse than directly feeding the entire document into Llama 3.1, although RAG methods still hold an advantage in reducing computational costs. In this paper, we propose a new retrieval method, called LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph (GARLIC), which outperforms previous state-of-the-art baselines, including Llama 3.1, while retaining the computational efficiency of RAG methods. Our method introduces several improvements: (1) Rather than using a tree structure, we construct a Hierarchical Weighted Directed Acyclic Graph with many-to-many summarization, where the graph edges are derived from attention mechanisms, and each node focuses on a single event or very few events. (2) We introduce a novel retrieval method that leverages the attention weights of LLMs rather than dense embedding similarity. Our method allows for searching the graph along multiple paths and can terminate at any depth. (3) We use the LLM to control the retrieval process, enabling it to dynamically adjust the amount and depth of information retrieved for different queries. Experimental results show that our method outperforms previous state-of-the-art baselines, including Llama 3.1, on two single-document and two multi-document QA datasets, while maintaining similar computational complexity to traditional RAG methods.