Harnessing Consistency for Robust Test-Time LLM Ensemble

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address error signal robustness issues in large language model (LLM) ensembling—arising from tokenization heterogeneity and capability disparities—this paper proposes CoRE, a Consistency-aware Robust Ensembling framework. CoRE introduces dual-granularity consistency modeling at both the token and model levels: it employs low-pass filtering to mitigate token alignment bias, and jointly applies confidence-weighted aggregation with output convergence constraints to minimize inter-model disagreement, enabling plug-and-play robust ensembling. Its core innovation lies in unifying cross-level consistency modeling while decoupling the handling of token misalignment and capability imbalance. Extensive experiments across multiple benchmarks, diverse LLM combinations, and various ensembling strategies demonstrate that CoRE consistently and significantly improves both accuracy and robustness, with stable performance gains across settings.

Technology Category

Application Category

📝 Abstract

Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.

Problem

Research questions and friction points this paper is trying to address.

Addresses robustness issues in LLM ensembles

Mitigates token-level disagreements from heterogeneous tokenization

Resolves model-level confidence disparities in ensemble outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages token-level consistency to filter uncertain tokens

Uses model-level consistency to select confident outputs

Integrates plug-and-play technique with existing ensemble methods

🔎 Similar Papers

No similar papers found.