🤖 AI Summary
To address severe hallucination and insufficient reasoning robustness in small-scale generative models (e.g., Qwen-1.5 0.5B), this paper proposes a Unified Virtual Expert Fusion (UVEF) framework that enhances factual consistency and reasoning accuracy without increasing model parameters. Methodologically, UVEF introduces: (1) a tunable-number virtual expert prompting mechanism; (2) a mean–standard-deviation-based statistical truncation strategy to suppress high-confidence hallucinated outputs; (3) fixed majority voting—replacing dynamic gating—to decouple expert contributions and improve interpretability; and (4) theoretical grounding from both statistical inference and ensemble learning perspectives. Experiments demonstrate significant reductions in hallucination rates and consistent improvements in accuracy and robustness across dialogue generation tasks. Ablation studies validate the efficacy of each component, support orthogonality assessment among experts, and enable future extensions to dynamic weighted aggregation.
📝 Abstract
Generative models, such as GPT and BERT, have significantly improved performance in tasks like text generation and summarization. However, hallucinations"where models generate non-factual or misleading content"are especially problematic in smaller-scale architectures, limiting their real-world applicability.In this paper, we propose a unified Virtual Mixture-of-Experts (MoE) fusion strategy that enhances inference performance and mitigates hallucinations in a single Qwen 1.5 0.5B model without increasing the parameter count. Our method leverages multiple domain-specific expert prompts (with the number of experts being adjustable) to guide the model from different perspectives. We apply a statistical outlier truncation strategy based on the mean and standard deviation to filter out abnormally high probability predictions, and we inject noise into the embedding space to promote output diversity. To clearly assess the contribution of each module, we adopt a fixed voting mechanism rather than a dynamic gating network, thereby avoiding additional confounding factors. We provide detailed theoretical derivations from both statistical and ensemble learning perspectives to demonstrate how our method reduces output variance and suppresses hallucinations. Extensive ablation experiments on dialogue generation tasks show that our approach significantly improves inference accuracy and robustness in small models. Additionally, we discuss methods for evaluating the orthogonality of virtual experts and outline the potential for future work involving dynamic expert weight allocation using gating networks.