Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of standardized evaluation in bias mitigation methods, which often impedes fair comparisons and compromises model utility. To this end, we propose NH-Fair, the first unified cross-modal benchmark that jointly evaluates fairness and harmlessness for both vision models and large vision-language models (LVLMs) under consistent data, metrics, and training protocols across supervised and zero-shot settings. Through systematic hyperparameter tuning, composite data augmentation, and multi-architecture comparisons, we find that well-tuned empirical risk minimization (ERM) baselines frequently outperform sophisticated debiasing approaches, and while LVLMs achieve higher overall accuracy, they still exhibit significant subgroup biases. Our study elucidates the impact of key training factors on fairness and provides a reproducible, tuning-aware evaluation framework.

Technology Category

Application Category

📝 Abstract
Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.
Problem

Research questions and friction points this paper is trying to address.

bias mitigation
fairness
vision models
large vision-language models
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

bias mitigation
fairness benchmark
large vision-language models
data augmentation
hyperparameter tuning
🔎 Similar Papers
No similar papers found.
X
Xuwei Tan
The Ohio State University
Z
Ziyu Hu
Stevens Institute of Technology
Xueru Zhang
Xueru Zhang
Assistant Professor, Computer Science and Engineering, The Ohio State University
responsible machine learning