🤖 AI Summary
Prior work lacks systematic evaluation of how data processing frameworks impact end-to-end deep learning training and inference—particularly regarding performance-energy trade-offs across data loading, preprocessing, and batch feeding stages in conjunction with GPU computation.
Method: We conduct the first comprehensive empirical study comparing Pandas, Polars, and Dask across diverse deep learning workloads—including CNNs and Transformers trained on ImageNet and WikiText—measuring runtime, memory footprint, disk I/O, and CPU/GPU power consumption under varying data scales and I/O characteristics.
Contribution/Results: Polars achieves optimal latency–energy efficiency for medium-scale in-memory datasets; Dask scales effectively to ultra-large distributed workloads but exhibits lower energy efficiency; Pandas remains practical for small-batch, interactive tasks. Our findings bridge a critical gap in co-optimizing data engineering infrastructure with AI training pipelines, providing empirical guidance for green AI system design and framework selection in production ML systems.
📝 Abstract
This paper presents a detailed comparative analysis of the performance of three major Python data manipulation libraries - Pandas, Polars, and Dask - specifically when embedded within complete deep learning (DL) training and inference pipelines. The research bridges a gap in existing literature by studying how these libraries interact with substantial GPU workloads during critical phases like data loading, preprocessing, and batch feeding. The authors measured key performance indicators including runtime, memory usage, disk usage, and energy consumption (both CPU and GPU) across various machine learning models and datasets.