HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of large language models (LLMs) in scientific hypothesis generation. We introduce HypoBench, the first dedicated benchmark comprising 7 real-world and 5 controllable synthetic tasks across 194 datasets, quantitatively assessing hypotheses along three dimensions: practicality, generalizability, and discovery rate. Methodologically, we propose a novel, multidimensional, and interpretable framework for hypothesis quality evaluation and design synthetic tasks with known ground-truth hypotheses to establish theoretical upper bounds on discovery rate. Empirical analysis reveals a fundamental bottleneck: state-of-the-art LLMs—combined with six mainstream approaches (including RAG and chain-of-thought)—recover only 38.8% of true hypotheses under high-difficulty conditions. We open-source HypoBench alongside standardized evaluation protocols, providing a reproducible, comparable, and extensible assessment infrastructure for AI-driven scientific discovery.

Technology Category

Application Category

📝 Abstract
There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for systematic hypothesis generation benchmarking
Assessing practical utility and generalizability of generated hypotheses
Identifying gaps in current methods' hypothesis discovery capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

HypoBench benchmark evaluates LLMs systematically
Combines four LLMs with six methods
Tests on 7 real and 5 synthetic tasks
🔎 Similar Papers
No similar papers found.
Haokun Liu
Haokun Liu
Vector Institute, University of Toronto
Natural Language Processing
S
Sicong Huang
Department of Computer Science, University of Toronto
J
Jingyu Hu
Department of Computer Science, University of Toronto
Y
Yangqiaoyu Zhou
Department of Computer Science, University of Chicago
Chenhao Tan
Chenhao Tan
University of Chicago
Human-centered AICommunication & IntelligenceScientific DiscoveryAI alignmentAI governance