🤖 AI Summary
Existing multilingual large language model (LLM) debiasing methods—e.g., SentDebias—exhibit limited cross-lingual transferability and fail to consistently mitigate bias across languages.
Method: We propose performing debiasing within a unified, semantically aligned cross-lingual latent space, rather than operating directly on raw model representations. Specifically, we train a cross-lingual autoencoder on parallel TED corpora to construct this aligned latent space, and integrate Aya-Expanse with multiple debiasing techniques therein.
Contribution/Results: Experiments across four languages demonstrate: (1) strong cross-lingual alignment in the learned latent space; (2) significantly improved debiasing performance over baselines; and (3) effective cross-lingual generalization of bias mitigation. To our knowledge, this is the first work to jointly optimize latent-space alignment and debiasing, thereby enhancing both generalizability and consistency in multilingual fairness modeling.
📝 Abstract
Debiasing techniques such as SentDebias aim to reduce bias in large language models (LLMs). Previous studies have evaluated their cross-lingual transferability by directly applying these methods to LLM representations, revealing their limited effectiveness across languages. In this work, we therefore propose to perform debiasing in a joint latent space rather than directly on LLM representations. We construct a well-aligned cross-lingual latent space using an autoencoder trained on parallel TED talk scripts. Our experiments with Aya-expanse and two debiasing techniques across four languages (English, French, German, Dutch) demonstrate that a) autoencoders effectively construct a well-aligned cross-lingual latent space, and b) applying debiasing techniques in the learned cross-lingual latent space significantly improves both the overall debiasing performance and cross-lingual transferability.