🤖 AI Summary
Existing question-answering systems struggle in real-world scenarios where evidence is scattered across massive, heterogeneous data lakes, and there is a lack of comprehensive benchmarks that jointly evaluate retrieval and multi-hop reasoning capabilities. To address this gap, this work proposes LakeQA—the first search-centric QA benchmark integrating tens of millions of structured and unstructured records (9.5TB from Wikipedia and government open data), expert annotations, and implicit multi-hop reasoning. LakeQA requires agents to retrieve and synthesize evidence across diverse sources to answer questions, thereby establishing a new evaluation standard for complex retrieval-reasoning tasks. Experiments on seven state-of-the-art large language models demonstrate the benchmark’s high difficulty—e.g., GPT-5.2 achieves only an 18.37% exact match rate—effectively probing models’ integrated capabilities.
📝 Abstract
Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.