Tursio Database Search: How far are we from ChatGPT?

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge enterprises face in efficiently and accurately querying structured databases via natural language, a task inadequately evaluated by existing benchmarks due to their lack of end-to-end assessment. Focusing on banking scenarios, the study introduces the first end-to-end evaluation framework encompassing real-world queries across multiple difficulty levels. It proposes a multidimensional metric system that integrates relevance, security, and conversational coherence, leveraging LLM-as-a-judge for automated evaluation. Experimental results show that Tursio achieves answer relevance comparable to ChatGPT and Perplexity—97.8% vs. 98.1% (simple), 90.0% vs. 100% (medium), and 89.5% vs. 100% (hard)—demonstrating the feasibility of specialized systems for structured data retrieval and revealing database completeness as the primary performance bottleneck.

Technology Category

Application Category

📝 Abstract

Business users need to search enterprise databases using natural language, just as they now search the web using ChatGPT or Perplexity. However, existing benchmarks -- designed for open-domain QA or text-to-SQL -- do not evaluate the end-to-end quality of such a search experience. We present an evaluation framework for structured database search that generates realistic banking queries across varying difficulty levels and assesses answer quality using relevance, safety, and conversational metrics via an LLM-as-judge approach. We apply this framework to compare Tursio, a database search platform, against ChatGPT and Perplexity on a credit union banking schema. Our results show that Tursio achieves answer relevancy statistically comparable to both baselines (97.8% vs. 98.1% on simple, 90.0% vs. 100.0% on medium, 89.5% vs. 100.0% on hard questions), even though Tursio answers from a structured database while the baselines generate responses from the open web. We analyze the failure modes, identify database completeness as the primary bottleneck, and outline directions for improving both the evaluation methodology and the systems under evaluation.

Problem

Research questions and friction points this paper is trying to address.

natural language database search

structured data querying

LLM-as-judge evaluation

enterprise database search

text-to-SQL

Innovation

Methods, ideas, or system contributions that make the work stand out.

database search

natural language interface

LLM-as-judge