Tursio Database Search: How far are we from ChatGPT?

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge enterprises face in efficiently and accurately querying structured databases via natural language, a task inadequately evaluated by existing benchmarks due to their lack of end-to-end assessment. Focusing on banking scenarios, the study introduces the first end-to-end evaluation framework encompassing real-world queries across multiple difficulty levels. It proposes a multidimensional metric system that integrates relevance, security, and conversational coherence, leveraging LLM-as-a-judge for automated evaluation. Experimental results show that Tursio achieves answer relevance comparable to ChatGPT and Perplexity—97.8% vs. 98.1% (simple), 90.0% vs. 100% (medium), and 89.5% vs. 100% (hard)—demonstrating the feasibility of specialized systems for structured data retrieval and revealing database completeness as the primary performance bottleneck.

Technology Category

Application Category

📝 Abstract
Business users need to search enterprise databases using natural language, just as they now search the web using ChatGPT or Perplexity. However, existing benchmarks -- designed for open-domain QA or text-to-SQL -- do not evaluate the end-to-end quality of such a search experience. We present an evaluation framework for structured database search that generates realistic banking queries across varying difficulty levels and assesses answer quality using relevance, safety, and conversational metrics via an LLM-as-judge approach. We apply this framework to compare Tursio, a database search platform, against ChatGPT and Perplexity on a credit union banking schema. Our results show that Tursio achieves answer relevancy statistically comparable to both baselines (97.8% vs. 98.1% on simple, 90.0% vs. 100.0% on medium, 89.5% vs. 100.0% on hard questions), even though Tursio answers from a structured database while the baselines generate responses from the open web. We analyze the failure modes, identify database completeness as the primary bottleneck, and outline directions for improving both the evaluation methodology and the systems under evaluation.
Problem

Research questions and friction points this paper is trying to address.

natural language database search
structured data querying
LLM-as-judge evaluation
enterprise database search
text-to-SQL
Innovation

Methods, ideas, or system contributions that make the work stand out.

database search
natural language interface
LLM-as-judge
evaluation framework
text-to-SQL
🔎 Similar Papers
No similar papers found.
S
Sulbha Jain
Independent Consultant
S
Shivani Tripathi
Tursio
S
Shi Qiao
Tursio
Alekh Jindal
Alekh Jindal
CEO and Co-founder, Tursio Inc.
Database SystemsInformation SystemsCloud Computing