Energy Consumption of Dataframe Libraries for End-to-End Deep Learning Pipelines:A Comparative Analysis

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks systematic evaluation of how data processing frameworks impact end-to-end deep learning training and inference—particularly regarding performance-energy trade-offs across data loading, preprocessing, and batch feeding stages in conjunction with GPU computation. Method: We conduct the first comprehensive empirical study comparing Pandas, Polars, and Dask across diverse deep learning workloads—including CNNs and Transformers trained on ImageNet and WikiText—measuring runtime, memory footprint, disk I/O, and CPU/GPU power consumption under varying data scales and I/O characteristics. Contribution/Results: Polars achieves optimal latency–energy efficiency for medium-scale in-memory datasets; Dask scales effectively to ultra-large distributed workloads but exhibits lower energy efficiency; Pandas remains practical for small-batch, interactive tasks. Our findings bridge a critical gap in co-optimizing data engineering infrastructure with AI training pipelines, providing empirical guidance for green AI system design and framework selection in production ML systems.

Technology Category

Application Category

📝 Abstract
This paper presents a detailed comparative analysis of the performance of three major Python data manipulation libraries - Pandas, Polars, and Dask - specifically when embedded within complete deep learning (DL) training and inference pipelines. The research bridges a gap in existing literature by studying how these libraries interact with substantial GPU workloads during critical phases like data loading, preprocessing, and batch feeding. The authors measured key performance indicators including runtime, memory usage, disk usage, and energy consumption (both CPU and GPU) across various machine learning models and datasets.
Problem

Research questions and friction points this paper is trying to address.

Comparing energy efficiency of dataframe libraries in deep learning pipelines
Analyzing library interactions with GPU workloads during data processing
Evaluating performance indicators like runtime and energy consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparing Pandas, Polars, Dask in deep learning
Analyzing energy usage during data loading phases
Measuring GPU and CPU consumption across models
🔎 Similar Papers
No similar papers found.
P
Punit Kumar
Department of Computer Science and Engineering, University at Buffalo, New York, USA
A
A. Imran
Department of Computer Science and Engineering, University at Buffalo, New York, USA
Tevfik Kosar
Tevfik Kosar
Professor, University at Buffalo (SUNY)
Distributed systemsgreen and sustainable computingAI/ML for systems