GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

πŸ“… 2025-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity, high cost, and poor long-tail coverage of high-quality annotated data in supervised fine-tuning, this paper proposes a knowledge graph (KG)-driven synthetic data generation framework tailored for atomic, aggregative, and multi-hop question answering. Methodologically, it introduces fine-grained KG modeling and an Expected Calibration Error (ECE)-based knowledge gap identification mechanism, integrated with multi-hop neighborhood sampling and style-controllable large language model (LLM) generation. This design significantly improves factual accuracy and long-tail knowledge coverage. Empirically, the approach outperforms existing synthetic data baselines across closed-book, knowledge-intensive tasks, enhancing both model generalization and factual consistency. The framework’s code and datasets are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.
Problem

Research questions and friction points this paper is trying to address.

Addressing costly high-quality data needs for LLM fine-tuning
Overcoming synthetic data inaccuracies and knowledge gaps
Enhancing QA data diversity and relational complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge graph-guided synthetic data generation
Multi-hop sampling for complex relations
Style-controlled generation for diverse QA data
πŸ”Ž Similar Papers
No similar papers found.
Z
Zihong Chen
Shanghai Artificial Intelligence Laboratory
W
Wanli Jiang
Shanghai Artificial Intelligence Laboratory
Jinzhe Li
Jinzhe Li
Fudan University & Shanghai AI Lab
AI4ScienceMulti-Modal
Z
Zhonghang Yuan
Shanghai Artificial Intelligence Laboratory
Huanjun Kong
Huanjun Kong
Shanghai AI Laboratory
infraapplication
W
Wanli Ouyang
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
Nanqing Dong
Nanqing Dong
Shanghai Artificial Intelligence Laboratory; University of Oxford
Machine LearningComputer VisionOptimizationAI for Science