BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the robustness deficits of large language models (LLMs) in syllogistic reasoning where conclusions are logically valid but contradict human commonsense beliefs. To address the lack of evaluation resources, we introduce J-BiST—the first large-scale, manually verified Japanese benchmark for belief-inconsistent syllogisms. J-BiST is systematically constructed from formal logic to generate contradictory premises and conclusions, enabling controlled assessment of belief–logic conflict resolution. We evaluate state-of-the-art LLMs—including GPT-4o, Claude, and leading Japanese-language models—using a zero-shot, multi-model evaluation framework. Results show that while GPT-4o achieves the highest accuracy (79.54%), all models exhibit significant belief bias, consistently failing to prioritize logical validity over intuitive plausibility. This reveals a systematic truth-aversion bias in LLMs under truth-prioritizing tasks. Our core contribution is a scalable, logic-grounded evaluation paradigm for belief–logic conflicts, enabling the first quantitative assessment of this capability in Japanese.

Technology Category

Application Category

📝 Abstract

We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.

Problem

Research questions and friction points this paper is trying to address.

Evaluating belief-inconsistent reasoning in Japanese LLMs

Uncovering reasoning biases in human-aligned LLMs

Assessing LLM performance on logically valid but belief-conflicting inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale Japanese syllogistic reasoning dataset

Designed for belief-inconsistent reasoning evaluation

Benchmarks state-of-the-art LLMs' performance variances

🔎 Similar Papers

No similar papers found.

Authors to Follow