Should I Hide My Duck in the Lake?

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In data lake queries, decoding Parquet files from remote storage can account for up to 46% of runtime overhead. To address this, this work proposes a cloud-oriented SmartNIC architecture that, for the first time, leverages SmartNICs to directly process pre-filtered columnar data. By offloading Parquet decoding and pushing down query operators onto the network data path, the design enables hardware-accelerated query execution. Integrated with DuckDB, the approach substantially reduces CPU resource consumption. Experimental results demonstrate that after SmartNIC-based pre-filtering, a modest CPU configuration achieves query throughput comparable to that of conventional systems relying on significantly more powerful host processors.

Technology Category

Application Category

📝 Abstract
Data lakes spend a significant fraction of query execution time on scanning data from remote storage. Decoding alone accounts for 46% of runtime when running TPC-H directly on Parquet files. To address this bottleneck, we propose a vision for a data processing SmartNIC for the cloud that sits on the network datapath of compute nodes to offload decoding and pushed-down operators, effectively hiding the cost of querying raw files. Our experimental estimations with DuckDB suggest that by operating directly on pre-filtered data as delivered by a SmartNIC, significantly smaller CPUs can still match query throughput of traditional setups.
Problem

Research questions and friction points this paper is trying to address.

data lake
query execution
decoding overhead
remote storage
performance bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

SmartNIC
data lake
query offloading
Parquet decoding
cloud data processing
🔎 Similar Papers
No similar papers found.