Do LLMs and Humans Find the Same Questions Difficult? A Case Study on Japanese Quiz Answering

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the alignment between large language models (LLMs) and human annotators in judging question difficulty within Japanese quiz-bowl–style问答. Using a novel, manually curated Japanese question-answering dataset, we employ multi-prompting strategies to elicit responses from LLMs and systematically compare their accuracy against human performance along two key dimensions: whether the answer is covered in Wikipedia and whether the answer is numeric. Results reveal that LLMs significantly underperform humans on questions involving Wikipedia-uncovered knowledge and those requiring numeric reasoning—highlighting their heavy reliance on training data coverage and fundamental limitations in flexible numerical computation. To our knowledge, this is the first empirical study to uncover structural misalignment in difficulty perception between LLMs and humans in a Japanese quiz context. The work provides a reproducible analytical framework for probing model knowledge boundaries and informing robust prompt engineering.

Technology Category

Application Category

📝 Abstract
LLMs have achieved performance that surpasses humans in many NLP tasks. However, it remains unclear whether problems that are difficult for humans are also difficult for LLMs. This study investigates how the difficulty of quizzes in a buzzer setting differs between LLMs and humans. Specifically, we first collect Japanese quiz data including questions, answers, and correct response rate of humans, then prompted LLMs to answer the quizzes under several settings, and compare their correct answer rate to that of humans from two analytical perspectives. The experimental results showed that, compared to humans, LLMs struggle more with quizzes whose correct answers are not covered by Wikipedia entries, and also have difficulty with questions that require numerical answers.
Problem

Research questions and friction points this paper is trying to address.

Comparing difficulty patterns between LLMs and humans in Japanese quiz answering
Investigating if human-difficult questions are also challenging for language models
Analyzing LLM performance on Wikipedia-independent and numerical answer questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing LLM performance on Japanese quiz data
Comparing human and LLM difficulty patterns systematically
Identifying Wikipedia coverage impact on LLM accuracy
🔎 Similar Papers
2024-03-15arXiv.orgCitations: 2
N
Naoya Sugiura
Graduate School of Informatics, Nagoya University
Kosuke Yamada
Kosuke Yamada
Cyberagent Inc.
Natural Language Processing
Y
Yasuhiro Ogawa
Graduate School of Informatics, Nagoya University
K
Katsuhiko Toyama
Graduate School of Informatics, Nagoya University
Ryohei Sasano
Ryohei Sasano
Associate Professor at Nagoya University
Natural Language Processing