PakBBQ: A Culturally Adapted Bias Benchmark for QA

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM bias evaluation benchmarks heavily rely on Western-centric datasets, neglecting low-resource languages and regional cultural contexts. Method: We introduce PakBBQ—the first culturally grounded, bilingual (English and Urdu) bias benchmark tailored to Pakistan’s sociocultural context. It comprises eight bias categories (e.g., religion, gender, age), 214 templates, and 17,180 bilingual QA pairs, innovatively integrating ambiguous/disambiguating contexts and positive/negative question framing to adapt the Bias Benchmark for QA to multilingual, culturally specific settings. Contribution/Results: Experiments show disambiguating prompts improve accuracy by 12% on average; Urdu-language models exhibit stronger anti-bias tendencies; and negatively framed questions significantly suppress stereotypical responses. This work pioneers bias evaluation localization to South Asian low-resource language settings and establishes a new paradigm for multilingual, culturally adaptive bias assessment.

Technology Category

Application Category

📝 Abstract
With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.
Problem

Research questions and friction points this paper is trying to address.

Assessing cultural biases in multilingual LLMs for underrepresented regions
Evaluating bias mitigation in QA systems using culturally adapted benchmarks
Measuring framing effects on stereotypical responses in low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Culturally adapted bias benchmark extension
Multilingual evaluation in ambiguous contexts
Prompt engineering for bias mitigation
🔎 Similar Papers
No similar papers found.