🤖 AI Summary
Multilingual text-to-image generation models inherit and exacerbate gender biases from monolingual counterparts, exhibiting significant cross-lingual inconsistency in bias amplification; existing neutral prompt engineering strategies fail to simultaneously mitigate bias and preserve text–image alignment. Method: We introduce MAGBIG—the first dedicated benchmark for evaluating gender bias in multilingual text-to-image generation—covering occupation- and trait-based prompts to systematically quantify cross-lingual bias. Through multilingual prompt construction, cross-lingual bias analysis, text embedding and image attribute statistics, and comparative prompt engineering experiments on models including Stable Diffusion, we empirically assess language-dependent bias. Contribution/Results: We provide the first empirical evidence that multilingual capability does not alleviate—but rather intensifies—gender stereotyping. Neutral prompting degrades text–image alignment by 12.7%, revealing fundamental limitations of current debiasing approaches. MAGBIG enables rigorous, cross-lingual bias evaluation and highlights the urgent need for language-aware fairness interventions.
📝 Abstract
Text-to-image generation models have recently achieved astonishing results in image quality, flexibility, and text alignment, and are consequently employed in a fast-growing number of applications. Through improvements in multilingual abilities, a larger community now has access to this technology. However, our results show that multilingual models suffer from significant gender biases just as monolingual models do. Furthermore, the natural expectation that multilingual models will provide similar results across languages does not hold up. Instead, there are important differences between languages. We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models. We use MAGBIG to investigate the effect of multilingualism on gender bias in T2I models. To this end, we construct multilingual prompts requesting portraits of people with a certain occupation or trait. Our results show that not only do models exhibit strong gender biases but they also behave differently across languages. Furthermore, we investigate prompt engineering strategies, such as indirect, neutral formulations, to mitigate these biases. Unfortunately, these approaches have limited success and result in worse text-to-image alignment. Consequently, we call for more research into diverse representations across languages in image generators, as well as into steerability to address biased model behavior.