TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality, diverse training data is scarce; existing synthesis methods are constrained by limited seed data, model biases, and insufficient controllability. Method: This paper proposes a tree-guided subspace partitioning framework for data synthesis. It recursively decomposes the data space into mutually exclusive and collectively exhaustive atomic subspaces, leveraging task-driven hierarchical tree construction, multi-round criterion-based decisions, and coverage-aware evaluation—integrated with large language models and structured spatial reasoning—to achieve seed-free, bias-free, and fully controllable end-to-end synthesis. Contribution/Results: This work establishes, for the first time, a unified paradigm ensuring interpretability of the generation process, comprehensive distribution coverage, and scalable diversity. Experiments demonstrate a 45.2% improvement in diversity and up to a 17.6% gain in downstream task performance—substantially outperforming both human annotation and state-of-the-art synthetic data approaches.

Technology Category

Application Category

📝 Abstract
Model customization requires high-quality and diverse datasets, but acquiring such data remains challenging and costly. Although large language models (LLMs) can synthesize training data, current approaches are constrained by limited seed data, model bias and insufficient control over the generation process, resulting in limited diversity and biased distribution with the increase of data scales. To tackle this challenge, we present TreeSynth, a tree-guided subspace-based data synthesis framework that recursively partitions the entire data space into hierar-chical subspaces, enabling comprehensive and diverse scaling of data synthesis. Briefly, given a task-specific description, we construct a data space partitioning tree by iteratively executing criteria determination and subspace coverage steps. This hierarchically divides the whole space (i.e., root node) into mutually exclusive and complementary atomic subspaces (i.e., leaf nodes). By collecting synthesized data according to the attributes of each leaf node, we obtain a diverse dataset that fully covers the data space. Empirically, our extensive experiments demonstrate that TreeSynth surpasses both human-designed datasets and the state-of-the-art data synthesis baselines, achieving maximum improvements of 45.2% in data diversity and 17.6% in downstream task performance across various models and tasks. Hopefully, TreeSynth provides a scalable solution to synthesize diverse and comprehensive datasets from scratch without human intervention.
Problem

Research questions and friction points this paper is trying to address.

Generating diverse synthetic data without human intervention
Overcoming limited diversity and bias in LLM-based data synthesis
Partitioning data space hierarchically for comprehensive coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-guided subspace partitioning for data synthesis
Hierarchical division of data space into subspaces
Automated diverse dataset generation without human intervention
🔎 Similar Papers
No similar papers found.