Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing physical commonsense reasoning datasets (e.g., PIQA) are predominantly English-centric and lack cultural diversity—particularly Korean cultural context. Method: We introduce Ko-Physics, the first Korean-language physical commonsense reasoning dataset explicitly grounded in Korean culture; 19.7% of its questions involve culturally specific elements (e.g., kimchi, hanbok, kimchi refrigerators), requiring culture-aware reasoning. A multi-stage curation pipeline combines initial filtering by three LMs, refinement via GPT-4o, and rigorous human validation, yielding 441 high-quality QA pairs. Contribution/Results: Experiments with seven state-of-the-art LMs reveal substantial performance variation (59.86%–83.22% accuracy), underscoring the critical role of cultural sensitivity in physical commonsense reasoning. This work pioneers systematic integration of cultural context into Korean-language physical commonsense modeling, advancing inclusive, cross-cultural commonsense reasoning research.

Technology Category

Application Category

📝 Abstract
Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean physical commonsense reasoning dataset that incorporates cultural context. Starting from 3.01 million web-crawled questions, we employed a multi-stage filtering approach using three language models to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, we obtained 441 high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural grounding: 19.7% of questions contain culturally specific elements like traditional Korean foods (kimchi), clothing (hanbok), and specialized appliances (kimchi refrigerators) that require culturally-aware reasoning beyond direct translation. We evaluate seven language models on Ko-PIQA, with the best model achieving 83.22% accuracy while the weakest reaches only 59.86%, demonstrating significant room for improvement. Models particularly struggle with culturally specific scenarios, highlighting the importance of culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. The dataset and code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Creating a Korean physical commonsense reasoning dataset
Addressing cultural diversity gaps in existing benchmarks
Evaluating models on culturally specific Korean scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage filtering using three language models
GPT-4o refinement for question-answer pairs
Cultural context integration with Korean elements
🔎 Similar Papers
No similar papers found.