Scaling Multi-Hop Training Data via Graph-Constrained Path Selection

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

157K/year
🤖 AI Summary
This work addresses the challenge of efficiently constructing high-quality multi-hop reasoning training data from unstructured, expert-level documents lacking annotated structure. The authors propose a graph-constrained path selection mechanism that first constructs an offline keyword-context centroid graph and then applies five geometric admissibility constraints to identify plausible reasoning paths. To mitigate embedding drift, Gram matrix analysis is integrated into the pipeline. Validated paths are subsequently transformed into question-answer pairs by a teacher model, decoupling logical reasoning from language generation. Evaluated on the CUAD legal contract corpus, the method synthesizes 80,000 training samples, boosting Qwen3-32B’s closed-book Token F1 score from 21.66% to 38.58% and expanding usable training data by a factor of 4.4, particularly enhancing performance on templated and cross-referential documents.
📝 Abstract
Endowing large language models with compositional reasoning over specialized documents requires multi-hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question-answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross-referencing clauses, conditions that characterize most real-world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre-validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram-matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ${\sim}91^{\circ}$, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched-size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4$\times$ expansion of the usable corpus rather than from higher per-chain quality -- reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine-tuning Qwen3-32B on 80K examples constructed from the CUAD legal contract corpus improves closed-book Token F1 from 21.66% to 38.58%. We have released our codes at https://github.com/hkgai-official/GCSCS.
Problem

Research questions and friction points this paper is trying to address.

multi-hop training data
compositional reasoning
specialized documents
graph-constrained path selection
teacher synthesizability
Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-constrained path selection
multi-hop reasoning
compositional reasoning
teacher synthesizability
embedding cliques