🤖 AI Summary
Existing synthetic graph generation methods struggle to jointly model high-dimensional attributes, heterogeneous structures, and semantic fidelity in complex graph data. This paper proposes the first end-to-end large language model (LLM)-based framework for generating complex heterogeneous graphs: it losslessly serializes graphs into joint structure-attribute sequences and employs a Transformer architecture to jointly model and generate nodes, edges, and semantic relationships. An efficient compression mechanism is introduced to accommodate long-context modeling while preserving privacy and security. To our knowledge, this is the first work to apply LLMs to the joint generation of heterogeneous graphs. Extensive evaluation on cybersecurity provenance graphs and the IntelliGraph knowledge graph demonstrates significant improvements in structural consistency and semantic fidelity. The framework enables scalable, high-quality synthetic graph generation, advancing the state of the art in privacy-preserving, semantically grounded graph synthesis.
📝 Abstract
The rise of graph-structured data has driven interest in graph learning and synthetic data generation. While successful in text and image domains, synthetic graph generation remains challenging -- especially for real-world graphs with complex, heterogeneous schemas. Existing research has focused mostly on homogeneous structures with simple attributes, limiting their usefulness and relevance for application domains requiring semantic fidelity.
In this research, we introduce ProvCreator, a synthetic graph framework designed for complex heterogeneous graphs with high-dimensional node and edge attributes. ProvCreator formulates graph synthesis as a sequence generation task, enabling the use of transformer-based large language models. It features a versatile graph-to-sequence encoder-decoder that 1. losslessly encodes graph structure and attributes, 2. efficiently compresses large graphs for contextual modeling, and 3. supports end-to-end, learnable graph generation.
To validate our research, we evaluate ProvCreator on two challenging domains: system provenance graphs in cybersecurity and knowledge graphs from IntelliGraph Benchmark Dataset. In both cases, ProvCreator captures intricate dependencies between structure and semantics, enabling the generation of realistic and privacy-aware synthetic datasets.