PROVCREATOR: Synthesizing Complex Heterogenous Graphs with Node and Edge Attributes

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic graph generation methods struggle to jointly model high-dimensional attributes, heterogeneous structures, and semantic fidelity in complex graph data. This paper proposes the first end-to-end large language model (LLM)-based framework for generating complex heterogeneous graphs: it losslessly serializes graphs into joint structure-attribute sequences and employs a Transformer architecture to jointly model and generate nodes, edges, and semantic relationships. An efficient compression mechanism is introduced to accommodate long-context modeling while preserving privacy and security. To our knowledge, this is the first work to apply LLMs to the joint generation of heterogeneous graphs. Extensive evaluation on cybersecurity provenance graphs and the IntelliGraph knowledge graph demonstrates significant improvements in structural consistency and semantic fidelity. The framework enables scalable, high-quality synthetic graph generation, advancing the state of the art in privacy-preserving, semantically grounded graph synthesis.

Technology Category

Application Category

📝 Abstract
The rise of graph-structured data has driven interest in graph learning and synthetic data generation. While successful in text and image domains, synthetic graph generation remains challenging -- especially for real-world graphs with complex, heterogeneous schemas. Existing research has focused mostly on homogeneous structures with simple attributes, limiting their usefulness and relevance for application domains requiring semantic fidelity. In this research, we introduce ProvCreator, a synthetic graph framework designed for complex heterogeneous graphs with high-dimensional node and edge attributes. ProvCreator formulates graph synthesis as a sequence generation task, enabling the use of transformer-based large language models. It features a versatile graph-to-sequence encoder-decoder that 1. losslessly encodes graph structure and attributes, 2. efficiently compresses large graphs for contextual modeling, and 3. supports end-to-end, learnable graph generation. To validate our research, we evaluate ProvCreator on two challenging domains: system provenance graphs in cybersecurity and knowledge graphs from IntelliGraph Benchmark Dataset. In both cases, ProvCreator captures intricate dependencies between structure and semantics, enabling the generation of realistic and privacy-aware synthetic datasets.
Problem

Research questions and friction points this paper is trying to address.

Synthetic generation of complex heterogeneous graphs with attributes
Addressing limitations of existing methods for semantic fidelity
Capturing dependencies between graph structure and semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based sequence generation for graphs
Lossless graph-to-sequence encoder-decoder framework
End-to-end learnable heterogeneous graph synthesis
🔎 Similar Papers
No similar papers found.