🤖 AI Summary
Existing physical commonsense reasoning datasets (e.g., PIQA) are predominantly English-centric and lack cultural diversity—particularly Korean cultural context. Method: We introduce Ko-Physics, the first Korean-language physical commonsense reasoning dataset explicitly grounded in Korean culture; 19.7% of its questions involve culturally specific elements (e.g., kimchi, hanbok, kimchi refrigerators), requiring culture-aware reasoning. A multi-stage curation pipeline combines initial filtering by three LMs, refinement via GPT-4o, and rigorous human validation, yielding 441 high-quality QA pairs. Contribution/Results: Experiments with seven state-of-the-art LMs reveal substantial performance variation (59.86%–83.22% accuracy), underscoring the critical role of cultural sensitivity in physical commonsense reasoning. This work pioneers systematic integration of cultural context into Korean-language physical commonsense modeling, advancing inclusive, cross-cultural commonsense reasoning research.
📝 Abstract
Physical commonsense reasoning datasets like PIQA are predominantly English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean physical commonsense reasoning dataset that incorporates cultural context. Starting from 3.01 million web-crawled questions, we employed a multi-stage filtering approach using three language models to identify 11,553 PIQA-style questions. Through GPT-4o refinement and human validation, we obtained 441 high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural grounding: 19.7% of questions contain culturally specific elements like traditional Korean foods (kimchi), clothing (hanbok), and specialized appliances (kimchi refrigerators) that require culturally-aware reasoning beyond direct translation. We evaluate seven language models on Ko-PIQA, with the best model achieving 83.22% accuracy while the weakest reaches only 59.86%, demonstrating significant room for improvement. Models particularly struggle with culturally specific scenarios, highlighting the importance of culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean language models and a foundation for more inclusive commonsense reasoning research. The dataset and code will be publicly available.