Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address parameter redundancy in large models and performance limitations of small models for sentiment detection, this paper proposes a lightweight heterogeneous ensemble framework comprising five compact language models—including BERT and RoBERTa. Grounded in the Condorcet Jury Theorem, it introduces a novel dual-weighted dynamic voting mechanism: global reliability is calibrated on validation set performance, while instance-level confidence is derived from fine-grained probabilistic modeling. This design preserves error diversity across base models while enabling synergistic performance gains. Experiments on the DAIR-AI benchmark yield a macro F1-score of 93.5%, substantially outperforming LoRA-finetuned large models—including Falcon-7B, Mistral-7B, Qwen-7B, and Phi-3—despite using only 595M total parameters. This represents over an 11× improvement in parameter efficiency and constitutes the first empirical demonstration that carefully designed small-model ensembles can surpass mainstream 7B-scale LMs in sentiment recognition accuracy.

Technology Category

Application Category

📝 Abstract

This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.

Problem

Research questions and friction points this paper is trying to address.

Ensembles small LLMs for emotion detection

Uses dual-weighted voting for model contributions

Outperforms larger LLMs with fewer parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble of diverse small transformer models

Dual-weighted voting with credibility and confidence

Parameter-efficient outperforming large LLMs

🔎 Similar Papers

No similar papers found.

Authors to Follow