Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address parameter redundancy in large models and performance limitations of small models for sentiment detection, this paper proposes a lightweight heterogeneous ensemble framework comprising five compact language models—including BERT and RoBERTa. Grounded in the Condorcet Jury Theorem, it introduces a novel dual-weighted dynamic voting mechanism: global reliability is calibrated on validation set performance, while instance-level confidence is derived from fine-grained probabilistic modeling. This design preserves error diversity across base models while enabling synergistic performance gains. Experiments on the DAIR-AI benchmark yield a macro F1-score of 93.5%, substantially outperforming LoRA-finetuned large models—including Falcon-7B, Mistral-7B, Qwen-7B, and Phi-3—despite using only 595M total parameters. This represents over an 11× improvement in parameter efficiency and constitutes the first empirical demonstration that carefully designed small-model ensembles can surpass mainstream 7B-scale LMs in sentiment recognition accuracy.

Technology Category

Application Category

📝 Abstract
This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.
Problem

Research questions and friction points this paper is trying to address.

Ensembles small LLMs for emotion detection
Uses dual-weighted voting for model contributions
Outperforms larger LLMs with fewer parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble of diverse small transformer models
Dual-weighted voting with credibility and confidence
Parameter-efficient outperforming large LLMs
🔎 Similar Papers
No similar papers found.
M
Menna Elgabry
Department of Computer Science, MSA University, Giza, Egypt
Ali Hamdi
Ali Hamdi
Computer Science, MSA University
Computer VisionDeep LearningText Mining