ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing quality filtering paradigms often erroneously discard valuable signals embedded in low-quality supervised fine-tuning (SFT) data. Method: We propose a neuro-symbolic collaborative framework for data purification and reconstruction. It employs a statistical-prior-driven symbolic rule module to remove noise and a neural reconstruction module—guided by model latent representations and domain knowledge—to generate high-quality instruction-response pairs. Crucially, this approach achieves superior performance using *only* low-quality data, without requiring any high-quality examples. Contribution/Results: On five mainstream instruction-following benchmarks, our method significantly outperforms 13 state-of-the-art data selection strategies. The enhanced dataset—constructed exclusively from low-quality data—surpasses the baseline model trained on ~300K raw, unfiltered samples, demonstrating that structurally reconstructed low-quality data attains higher information density and training efficacy.

Technology Category

Application Category

📝 Abstract
Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs) to domain-specific instructions by training on a carefully curated subset of high-quality instruction-response pairs, typically drawn from a larger dataset that often contains many low-quality or noisy samples. However, existing quality-first paradigms often overlook valuable signals in discarded low-quality data and rely on imperfect quality filters. We introduce ENTP (Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a framework that revitalizes low-quality corpora through symbolic purification and neural reconstruction. The symbolic module identifies and prunes noisy samples based on statistical priors, while the neural component synthesizes enriched instruction-response pairs by leveraging latent representations and model knowledge. This neural-symbolic synergy enhances data informativeness and diversity. Experiments show that ENTP-augmented datasets, constructed exclusively from low-quality data, outperform 13 established data-selection baselines across five instruction-following benchmarks, and even surpass fine-tuning on the full original dataset (approximately 300K examples). Our results highlight the untapped potential of low-quality data and underscore the importance of intelligent purification and synthesis for efficient instruction alignment.
Problem

Research questions and friction points this paper is trying to address.

Enhancing low-quality SFT data through neural-symbolic purification and reconstruction
Revitalizing discarded noisy samples to improve instruction-following capabilities
Overcoming limitations of imperfect quality filters in supervised fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neural-symbolic framework purifies and reconstructs low-quality data
Symbolic module prunes noisy samples using statistical priors
Neural component synthesizes enriched instruction-response pairs
🔎 Similar Papers
No similar papers found.
Z
Zile Yang
The Hong Kong University of Science and Technology (Guangzhou)
L
Ling Li
The Hong Kong University of Science and Technology (Guangzhou)
N
Na Di
The Hong Kong University of Science and Technology (Guangzhou)
Jinlong Pang
Jinlong Pang
University of California, Santa Cruz
Trustworthy AILLMs
Y
Yao Zhou
The Hong Kong University of Science and Technology (Guangzhou)
H
Hao Cheng
Hong Kong Baptist University
B
Bo Han
Hong Kong Baptist University
J
Jiaheng Wei
The Hong Kong University of Science and Technology (Guangzhou)