CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of general-purpose embedding models in table retrieval, which stem from structural semantic compression and mismatches between queries and tabular content. To overcome these challenges, the authors propose a clustering-guided partial table construction approach that leverages K-means clustering and large language models (LLMs) to generate semantically diverse synthetic queries, thereby creating high-quality supervision signals. The embedding model is then fine-tuned using contrastive learning with hard negative examples. This strategy substantially enhances semantic coverage and cross-domain generalization, achieving an average 16.54% improvement in Recall@1 across four public benchmarks—outperforming existing methods. Notably, the approach remains highly effective even when using smaller-scale LLMs and a unified multi-domain corpus.

Technology Category

Application Category

📝 Abstract
General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across four public benchmarks (MimoTable, OTTQA, FetaQA, and E2E-WTQ) show that CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent. In a unified multi-domain corpus setting, CGPT further demonstrates strong cross-domain generalization and remains effective even when using smaller LLMs for synthetic query generation. These results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval. Our code is available at https://github.com/yumeow0122/CGPT.
Problem

Research questions and friction points this paper is trying to address.

table retrieval
semantic compression
query-table mismatch
structured data
embedding models
Innovation

Methods, ideas, or system contributions that make the work stand out.

table retrieval
LLM-generated supervision
partial table construction
contrastive fine-tuning
semantic clustering
🔎 Similar Papers
No similar papers found.
T
Tsung-Hsiang Chou
National Chung Hsing University, Smart Sustainable New Agriculture Research Center (SMARTer), Taichung, Taiwan
C
Chen-Jui Yu
National Chung Hsing University, Smart Sustainable New Agriculture Research Center (SMARTer), Taichung, Taiwan
S
Shui-Hsiang Hsu
National Chung Hsing University, Smart Sustainable New Agriculture Research Center (SMARTer), Taichung, Taiwan
Yao-Chung Fan
Yao-Chung Fan
National Chung Hsing University, Taiwan
Natural Language ProcessingData MiningNatural Language Generation