Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Current LLM training is hindered by the scarcity of high-quality, multi-hop reasoning question-answer (QA) pairs—especially in sparse domains such as PubMed and legal texts—while existing methods fail to generate controllable, complex, cross-document reasoning questions, limiting models’ deep semantic understanding. Method: We propose the first general-purpose controllable framework for synthesizing multi-hop QA pairs, leveraging Abstract Meaning Representation (AMR)-driven semantic graph weaving. It constructs cross-document reasoning paths via three strategies: entity bridging, predicate chain expansion, and causal inference. Integrated multimodal AMR analysis and graph-structured synthesis enable fine-grained control over question type, complexity, multilingual support, and cross-domain semantic relation extraction. Results: Experiments show our method outperforms baselines by 18.3%–25.4% across four languages; QA pairs generated from only 200 source documents surpass the performance of those derived from 600 human-annotated samples. Human evaluation confirms a 23.4% increase in question complexity and an 18.7% improvement in answerability.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present extbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is extit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.

Problem

Research questions and friction points this paper is trying to address.

Scarcity of high-quality reasoning QA pairs for LLM training

Existing methods fail to generate controllable multi-hop questions

Need for universal framework to create complex reasoning questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic graph weaving for multi-hop reasoning

AMR-driven analysis for fine-grained control

Multi-modal pipeline for high-quality QA generation

🔎 Similar Papers

No similar papers found.