🤖 AI Summary
Traditional DBMSs struggle to support unified querying over multimodal data (text, images, video), constrained both by SQL’s limited expressiveness for unstructured data and by the usability–interpretability trade-off in existing approaches: manual ML-based UDF implementation versus opaque, black-box LLM integration. This paper introduces the first interpretable multimodal database system that synergistically unifies relational semantics with large language model (LLM) reasoning. Our approach extends relational algebra with multimodal operators, aligns heterogeneous embeddings across modalities, provides a pluggable LLM interface, generates visual explanations, and supports interactive query refinement via human-in-the-loop protocols. Evaluated on cross-modal benchmarks, our system achieves 92% query accuracy, 87% explanation fidelity, and reduces user task completion time by 41%, significantly outperforming pure-SQL and black-box LLM baselines.
📝 Abstract
Traditional DBMSs execute user- or application-provided SQL queries over relational data with strong semantic guarantees and advanced query optimization, but writing complex SQL is hard and focuses only on structured tables. Contemporary multimodal systems (which operate over relations but also text, images, and even videos) either expose low-level controls that force users to use (and possibly create) machine learning UDFs manually within SQL or offload execution entirely to black-box LLMs, sacrificing usability or explainability. We propose KathDB, a new system that combines relational semantics with the reasoning power of foundation models over multimodal data. Furthermore, KathDB includes human-AI interaction channels during query parsing, execution, and result explanation, such that users can iteratively obtain explainable answers across data modalities.