NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Fine-tuning large language models (LLMs) for multi-domain reasoning remains hindered by reliance on large-scale, costly-to-annotate datasets and prohibitive computational overhead. Method: We propose NanoFlux—a lightweight, adversarial framework featuring an attacker-defender dual-model architecture, augmented by a tool-enhanced referee model that automatically generates high-quality, multi-step reasoning questions with explanatory annotations. It further incorporates embedded novelty filtering and multi-hop reasoning evaluation to enable automated synthesis and intelligent curation of compact, high-fidelity training data (“small but precise”). Contribution/Results: On a 4B-parameter model, fine-tuning with only a small set of generated samples yields +5.9%, +3.6%, and +16.6% improvements in mathematical, scientific, and medical reasoning accuracy, respectively, while reducing computational cost by 3–14×. Crucially, we empirically uncover a non-monotonic relationship between data quality and model performance—demonstrating the superior optimization potential of highly targeted, minimal datasets.

Technology Category

Application Category

📝 Abstract

We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets containing fewer than 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14x. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets.

Problem

Research questions and friction points this paper is trying to address.

Generating targeted training data for LLM reasoning improvement

Employing adversarial framework with Attacker-Defender dynamic for multi-step questions

Achieving performance gains across mathematical scientific medical domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial framework with dual-LLM competition dynamics

Tool-augmented Judge synthesizes multi-step reasoning questions

Embedding-based filtering automates targeted training data generation

🔎 Similar Papers

No similar papers found.

Authors to Follow