🤖 AI Summary
To address the inefficiency of training data selection in task-specific instruction fine-tuning, this paper proposes, for the first time, a cognitive-mechanism-driven data selection paradigm inspired by neural co-activation in the human brain. Specifically, it leverages intermediate-layer neuron activation states of large language models (LLMs) as sample embeddings. High-dimensional activation vectors are extracted and subsequently employed in similarity-based scoring and submodular optimization to enable efficient, principled data filtering. Extensive experiments across multiple LLMs (LLaMA-2, Qwen), diverse NLP tasks (NLI, RE, QA), and varying selection ratios (10%–50%) demonstrate consistent superiority over baseline methods: instruction fine-tuning accuracy improves by 2.1–4.7 percentage points on average, data utilization efficiency increases by over 3×, and the approach exhibits enhanced generalization and robustness.
📝 Abstract
Task-specific instruction tuning enhances the performance of large language models (LLMs) on specialized tasks, yet efficiently selecting relevant data for this purpose remains a challenge. Inspired by neural coactivation in the human brain, we propose a novel data selection method called NAS, which leverages neuronal activation states as embeddings for samples in the feature space. Extensive experiments show that NAS outperforms classical data selection methods in terms of both effectiveness and robustness across different models, datasets, and selection ratios.