SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

High inter-GPU communication overhead in tensor-parallel inference of large language models (LLMs) severely limits scalability and increases latency. Method: This paper proposes Synchronization Point Dropping (SPD), the first attention-block sensitivity-aware adaptive synchronization mechanism that selectively skips full-synchronization operations for low-sensitivity attention outputs, thereby breaking the conventional all-reduce–based synchronization paradigm. SPD integrates communication-computation overlap and precision-aware hierarchical synchronization to preserve model accuracy while reducing synchronization frequency. Contribution/Results: Under an 8-GPU configuration, SPD reduces end-to-end inference latency by approximately 20% for LLaMA2-70B, with less than 1% accuracy degradation. It effectively alleviates the communication bottleneck and establishes a novel paradigm for efficient distributed LLM inference.

Technology Category

Application Category

📝 Abstract

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with<1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication overhead in tensor parallelism

Selectively drops synchronization on attention outputs

Improves scalability and latency in LLM inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sync-Point Drop reduces tensor parallelism communication overheads.

Block design enables execution without communication via SPD.

SPD strategies adapt to attention block sensitivity for accuracy.

🔎 Similar Papers

No similar papers found.