Should I Hide My Duck in the Lake?

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

In data lake queries, decoding Parquet files from remote storage can account for up to 46% of runtime overhead. To address this, this work proposes a cloud-oriented SmartNIC architecture that, for the first time, leverages SmartNICs to directly process pre-filtered columnar data. By offloading Parquet decoding and pushing down query operators onto the network data path, the design enables hardware-accelerated query execution. Integrated with DuckDB, the approach substantially reduces CPU resource consumption. Experimental results demonstrate that after SmartNIC-based pre-filtering, a modest CPU configuration achieves query throughput comparable to that of conventional systems relying on significantly more powerful host processors.

Technology Category

Application Category

📝 Abstract

Data lakes spend a significant fraction of query execution time on scanning data from remote storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data as delivered by a SmartNIC, significantly smaller CPUs can still match query throughput of traditional setups.

Problem

Research questions and friction points this paper is trying to address.

data lake

query execution

decoding overhead

remote storage

performance bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

SmartNIC

data lake

query offloading