🤖 AI Summary
This work addresses the challenge of prediction inconsistency among high-accuracy models performing the same task, which undermines trustworthiness in high-stakes scenarios due to the Rashomon effect. To mitigate this multiplicity of predictions, the authors propose three complementary strategies—outlier correction, local bias rectification, and pairwise prediction reconciliation—that can be applied either jointly or independently to reduce inter-model disagreement. The harmonized ensemble predictions are then distilled via knowledge distillation into a single, interpretable model. Experimental results across multiple datasets demonstrate that the proposed approach significantly lowers prediction inconsistency metrics while maintaining competitive accuracy, thereby achieving a favorable balance among reliability, consistency, and interpretability.
📝 Abstract
The existence of multiple, equally accurate models for a given predictive task leads to predictive multiplicity, where a ``Rashomon set''of models achieve similar accuracy but diverges in their individual predictions. This inconsistency undermines trust in high-stakes applications where we want consistent predictions. We propose three approaches to reduce inconsistency among predictions for the members of the Rashomon set. The first approach is \textbf{outlier correction}. An outlier has a label that none of the good models are capable of predicting correctly. Outliers can cause the Rashomon set to have high variance predictions in a local area, so fixing them can lower variance. Our second approach is local patching. In a local region around a test point, models may disagree with each other because some of them are biased. We can detect and fix such biases using a validation set, which also reduces multiplicity. Our third approach is pairwise reconciliation, where we find pairs of models that disagree on a region around the test point. We modify predictions that disagree, making them less biased. These three approaches can be used together or separately, and they each have distinct advantages. The reconciled predictions can then be distilled into a single interpretable model for real-world deployment. In experiments across multiple datasets, our methods reduce disagreement metrics while maintaining competitive accuracy.