LINEAGEX: A Column Lineage Extraction System for SQL

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of automatic column-level data lineage extraction in enterprise data governance—where existing approaches either incur runtime overhead or suffer from insufficient accuracy—this paper proposes LINEAGEX, a lightweight static SQL parser based on intelligent traversal of SQL Abstract Syntax Trees (ASTs). LINEAGEX avoids query execution entirely, achieving high coverage and precision through ambiguity-aware column reference resolution and cross-statement contextual tracking. Implemented in Python, the system integrates an interactive lineage graph visualization frontend and is open-sourced on GitHub. Evaluated on real-world enterprise SQL workloads, LINEAGEX achieves state-of-the-art column-level lineage accuracy, significantly outperforming mainstream tools. Its robust, execution-free analysis effectively supports critical governance tasks including data quality diagnostics, storage optimization, and workflow migration.

Technology Category

Application Category

📝 Abstract
As enterprise data grows in size and complexity, column-level data lineage, which records the creation, transformation, and reference of each column in the warehouse, has been the key to effective data governance that assists tasks like data quality monitoring, storage refactoring, and workflow migration. Unfortunately, existing systems introduce overheads by integration with query execution or fail to achieve satisfying accuracy for column lineage. In this paper, we demonstrate LINEAGEX, a lightweight Python library that infers column level lineage from SQL queries and visualizes it through an interactive interface. LINEAGEX achieves high coverage and accuracy for column lineage extraction by intelligently traversing query parse trees and handling ambiguities. The demonstration walks through use cases of building lineage graphs and troubleshooting data quality issues. LINEAGEX is open sourced at https://github.com/sfu-db/lineagex and our video demonstration is at https://youtu.be/5LaBBDDitlw
Problem

Research questions and friction points this paper is trying to address.

Extracts column-level lineage from SQL queries
Improves accuracy and coverage in lineage extraction
Visualizes lineage via an interactive interface
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Python library for SQL lineage
Intelligently traverses query parse trees
Interactive visualization of column lineage
🔎 Similar Papers
No similar papers found.
S
Shi Heng Zhang
Simon Fraser University, Burnaby, Canada
Zhengjie Miao
Zhengjie Miao
Simon Fraser University
Databases
J
Jiannan Wang
Simon Fraser University, Burnaby, Canada