🤖 AI Summary
This work addresses the chart-driven data discovery problem: given a line chart query, how to retrieve tabular data from a large data warehouse that would generate a semantically similar line chart—especially when original data is unavailable. To this end, we propose Fine-grained Cross-modal Correlation Learning (FCM), the first model enabling semantic alignment between line charts and tabular data. FCM employs a visual element extractor to encode chart structure, a dual-segment encoder to jointly represent charts and tables, and integrates fine-grained cross-modal matching with joint chart–data representation learning. We introduce the first dedicated benchmark supporting aggregate-line-chart queries. Experiments demonstrate that FCM outperforms the strongest baseline by 30.1% in Prec@50 and 41.0% in NDCG@50, significantly advancing cross-modal chart-to-data retrieval research.
📝 Abstract
Line charts are a valuable tool for data analysis and exploration, distilling essential insights from a dataset. However, access to the underlying dataset behind a line chart is rarely readily available. In this paper, we explore a novel dataset discovery problem, dataset discovery via line charts, focusing on the use of line charts as queries to discover datasets within a large data repository that are capable of generating similar line charts. To solve this problem, we propose a novel approach called Fine-grained Cross-modal Relevance Learning Model (FCM), which aims to estimate the relevance between a line chart and a candidate dataset. To achieve this goal, FCM first employs a visual element extractor to extract informative visual elements, i.e., lines and y-ticks, from a line chart. Then, two novel segment-level encoders are adopted to learn representations for a line chart and a dataset, preserving fine-grained information, followed by a cross-modal matcher to match the learned representations in a fine-grained way. Furthermore, we extend FCM to support line chart queries generated based on data aggregation. Last, we propose a benchmark tailored for this problem since no such dataset exists. Extensive evaluation on the new benchmark verifies the effectiveness of our proposed method. Specifically, our proposed approach surpasses the best baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.