Analysis of Server Throughput For Managed Big Data Analytics Frameworks

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Managed big-data frameworks (e.g., Spark, Giraph) suffer from high garbage collection (GC) overhead due to Java heap memory pressure, while offloading objects to external storage incurs substantial serialization/deserialization (S/D) costs; merely scaling DRAM exacerbates low CPU core utilization. Method: We identify the synergistic bottleneck of GC and I/O overheads on server throughput and propose TeraHeap—a two-tier heap mechanism enabling dynamic DRAM budget allocation between the Java heap (H1) and page cache (PC), jointly optimized via JVM tuning and multi-instance memory co-location. Contribution/Results: Across diverse memory–core configurations, TeraHeap improves effective CPU utilization by up to 2.3× over baseline approaches, significantly boosting server throughput. It establishes a lightweight, system-level memory coordination paradigm for resource-constrained big-data execution environments.

Technology Category

Application Category

📝 Abstract

Managed big data frameworks, such as Apache Spark and Giraph demand a large amount of memory per core to process massive volume datasets effectively. The memory pressure that arises from the big data processing leads to high garbage collection (GC) overhead. Big data analytics frameworks attempt to remove this overhead by offloading objects to storage devices. At the same time, infrastructure providers, trying to address the same problem, attribute more memory to increase memory per instance leaving cores underutilized. For frameworks, trying to avoid GC through offloading to storage devices leads to high Serialization/Deserialization (S/D) overhead. For infrastructure, the result is that resource usage is decreased. These limitations prevent managed big data frameworks from effectively utilizing the CPU thus leading to low server throughput. We conduct a methodological analysis of server throughput for managed big data analytics frameworks. More specifically, we examine, whether reducing GC and S/D can help increase the effective CPU utilization of the server. We use a system called TeraHeap that moves objects from the Java managed heap (H1) to a secondary heap over a fast storage device (H2) to reduce the GC overhead and eliminate S/D over data. We focus on analyzing the system's performance under the co-location of multiple memory-bound instances to utilize all available DRAM and study server throughput. Our detailed methodology includes choosing the DRAM budget for each instance and how to distribute this budget among H1 and Page Cache (PC). We try two different distributions for the DRAM budget, one with more H1 and one with more PC to study the needs of both approaches. We evaluate both techniques under 3 different memory-per-core scenarios using Spark and Giraph with native JVM or JVM with TeraHeap. We do this to check throughput changes when memory capacity increases.

Problem

Research questions and friction points this paper is trying to address.

Reducing garbage collection overhead in big data frameworks

Minimizing serialization/deserialization costs for offloaded objects

Optimizing CPU utilization under memory-bound instance co-location

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses TeraHeap to reduce GC overhead

Moves objects to secondary heap H2

Optimizes DRAM budget distribution

🔎 Similar Papers

No similar papers found.