Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM training is hindered by the scarcity of high-quality, multi-hop reasoning question-answer (QA) pairs—especially in sparse domains such as PubMed and legal texts—while existing methods fail to generate controllable, complex, cross-document reasoning questions, limiting models’ deep semantic understanding. Method: We propose the first general-purpose controllable framework for synthesizing multi-hop QA pairs, leveraging Abstract Meaning Representation (AMR)-driven semantic graph weaving. It constructs cross-document reasoning paths via three strategies: entity bridging, predicate chain expansion, and causal inference. Integrated multimodal AMR analysis and graph-structured synthesis enable fine-grained control over question type, complexity, multilingual support, and cross-domain semantic relation extraction. Results: Experiments show our method outperforms baselines by 18.3%–25.4% across four languages; QA pairs generated from only 200 source documents surpass the performance of those derived from 600 human-annotated samples. Human evaluation confirms a 23.4% increase in question complexity and an 18.7% improvement in answerability.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present extbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is extit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.
Problem

Research questions and friction points this paper is trying to address.

Scarcity of high-quality reasoning QA pairs for LLM training
Existing methods fail to generate controllable multi-hop questions
Need for universal framework to create complex reasoning questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic graph weaving for multi-hop reasoning
AMR-driven analysis for fine-grained control
Multi-modal pipeline for high-quality QA generation
🔎 Similar Papers
No similar papers found.
Linqing Chen
Linqing Chen
Patsnap
H
Hanmeng Zhong
W
Wentao Wu
W
Weilei Wang