GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the scarcity, high cost, and poor long-tail coverage of high-quality annotated data in supervised fine-tuning, this paper proposes a knowledge graph (KG)-driven synthetic data generation framework tailored for atomic, aggregative, and multi-hop question answering. Methodologically, it introduces fine-grained KG modeling and an Expected Calibration Error (ECE)-based knowledge gap identification mechanism, integrated with multi-hop neighborhood sampling and style-controllable large language model (LLM) generation. This design significantly improves factual accuracy and long-tail knowledge coverage. Empirically, the approach outperforms existing synthetic data baselines across closed-book, knowledge-intensive tasks, enhancing both model generalization and factual consistency. The framework’s code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.

Problem

Research questions and friction points this paper is trying to address.

Addressing costly high-quality data needs for LLM fine-tuning

Overcoming synthetic data inaccuracies and knowledge gaps

Enhancing QA data diversity and relational complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge graph-guided synthetic data generation

Multi-hop sampling for complex relations

Style-controlled generation for diverse QA data

🔎 Similar Papers

SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task