MoLink: Distributed and Efficient Serving Framework for Large Models

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Serving large language models (LLMs) on consumer-grade GPUs in low-bandwidth, heterogeneous host environments remains challenging due to communication bottlenecks and inefficient resource utilization. Method: This paper proposes a lightweight distributed inference framework featuring cross-public-network and Ethernet-coordinated tensor parallelism, dynamic load balancing, heterogeneous memory-aware scheduling, and a low-overhead communication protocol—enabling the first multi-GPU aggregated inference under weak-network conditions. Contribution/Results: The framework supports 18 mainstream open-source LLMs and offers plug-and-play deployment across Windows, Linux, and containerized VMs. Experiments demonstrate up to 458% higher throughput and a 151% increase in profit margin per unit computational cost compared to state-of-the-art systems, significantly lowering both the technical barrier and operational cost of LLM serving.

Technology Category

Application Category

📝 Abstract

Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs. However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458% and cost-profit margin improvements of up to 151%, compared to state-of-the-art systems. MoLink allows users on Windows, Linux, and containerized VMs to seamlessly integrate GPUs with just a few lines of code over Ethernet or public networks. Currently, it supports 18 mainstream architectures of open-source large language models.

Problem

Research questions and friction points this paper is trying to address.

High cost of serving large language models efficiently

Challenges in using consumer-grade GPUs for LLM serving

Network limitations and system heterogeneity in GPU deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed serving for large models

Efficient use of consumer-grade GPUs

Handles heterogeneous and weak networks

🔎 Similar Papers

No similar papers found.

Authors to Follow