🤖 AI Summary
Current LLM training is hindered by the scarcity of high-quality, multi-hop reasoning question-answer (QA) pairs—especially in sparse domains such as PubMed and legal texts—while existing methods fail to generate controllable, complex, cross-document reasoning questions, limiting models’ deep semantic understanding.
Method: We propose the first general-purpose controllable framework for synthesizing multi-hop QA pairs, leveraging Abstract Meaning Representation (AMR)-driven semantic graph weaving. It constructs cross-document reasoning paths via three strategies: entity bridging, predicate chain expansion, and causal inference. Integrated multimodal AMR analysis and graph-structured synthesis enable fine-grained control over question type, complexity, multilingual support, and cross-domain semantic relation extraction.
Results: Experiments show our method outperforms baselines by 18.3%–25.4% across four languages; QA pairs generated from only 200 source documents surpass the performance of those derived from 600 human-annotated samples. Human evaluation confirms a 23.4% increase in question complexity and an 18.7% improvement in answerability.
📝 Abstract
Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present extbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is extit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.