ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing data engineering benchmarks evaluate isolated subtasks—such as SQL generation or tool invocation—failing to assess AI agents’ capability to construct end-to-end ELT pipelines. To address this gap, we propose ELT-Bench, the first end-to-end ELT benchmark grounded in realistic cloud data warehouse scenarios. It comprises 100 cross-domain pipelines, 835 source tables, and 203 target data models, comprehensively evaluating multi-source integration, tool invocation, SQL generation, and workflow orchestration. ELT-Bench simulates real-world data toolchains, integrates Spider-Agent and SWE-Agent frameworks, and supports interaction with databases and multi-step workflow execution across six state-of-the-art LLMs—including Claude-3.7-Sonnet. Experimental results reveal that the best-performing AI agent successfully constructs only 3.9% of target data models, requiring on average 89.3 steps per pipeline and incurring a cost of $4.30, highlighting substantial capability bottlenecks in complex data engineering tasks.

Technology Category

Application Category

📝 Abstract

Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI-based methods, which have shown strong capabilities in data tasks, such as text-to-SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end-to-end ELT pipelines. To fill this gap, we introduce ELT-Bench, an end-to-end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT-Bench evaluates AI agents' abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench. The highest-performing agent, Spider-Agent Claude-3.7-Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of $4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT-Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at https://github.com/uiuc-kang-lab/ETL.git.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents for end-to-end ELT pipeline generation

Assessing AI capabilities in complex data engineering workflows

Addressing gaps in current benchmarks for ELT pipeline tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end benchmark for ELT pipelines

Simulates diverse data integration scenarios

Evaluates AI agents on complex workflows

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation