🤖 AI Summary
This work addresses the performance bottleneck in distributed hash joins over wide-area networks caused by data skew, which leads to severe imbalance in computation and communication loads. To tackle this challenge, the authors propose Bala-Join, a novel approach that dynamically balances join workloads across geographically distributed SQL databases through an adaptive redistribution strategy. The core contributions include the Balanced Partitioning with Partial Replication (BPPR) algorithm, a distributed online skew-key detector, and the ASAP synchronization mechanism that integrates multicast-based redistribution, proactive signaling, and asynchronous pull. Experimental evaluation on real-world WAN deployments demonstrates that Bala-Join improves throughput by 25%–61% compared to state-of-the-art baselines while significantly reducing communication overhead and tail latency.
📝 Abstract
Shared-nothing geo-distributed SQL databases, such as CockroachDB, are increasingly vital for enterprise applications requiring data resilience and locality. However, we encountered significant performance degradation at the customer side, especially when their deployments span multiple data centers over a Wide Area Network (WAN). Our investigation identifies the bottleneck in the performance of the Distributed Hash Join (Dist-HJ) algorithm, which is contingent upon a crucial balance between communication overhead and computational load. This balance is severely disrupted when processing skewed data from real-world customer workloads, leading to the observed performance decline. To tackle this challenge, we introduce Bala-Join, an adaptive solution to balance the computation and network load in Dist-HJ execution. Our approach consists of the Balanced Partition and Partial Replication (BPPR) algorithm and a distributed online skewed join key detector. The former achieves balanced redistribution of skewed data through a multicast mechanism to improve computational performance and reduce network overhead. The latter provides real-time skewed join key information tailored to BPPR. Furthermore, an Active-Signaling and Asynchronous-Pulling (ASAP) mechanism is incorporated to enable efficient, real-time synchronization between the detector and the redistribution process with minimal overhead. Empirical study shows that Bala-Join outperforms the popular Dist-HJ solutions, increasing throughput by 25%-61%.