🤖 AI Summary
This work addresses the low GPU utilization and device idling caused by communication latency and data dependencies in multi-tenant data centers when deploying parameter-efficient fine-tuning (PEFT) tasks independently. To overcome these challenges, the authors propose a modular backbone sharing mechanism based on a unified PEFT representation, enabling efficient concurrent execution of multiple tasks through joint spatial and temporal reuse strategies. The approach integrates coordinated scheduling across task, operator, and data levels, leveraging hierarchical scheduling, hybrid spatial-temporal reuse, two-stage hybrid parallelism, and data-chunk alignment to mitigate inefficiencies from invalid tokens. Experimental results demonstrate that the proposed system achieves up to a 2.33× improvement in throughput and reduces memory consumption by up to 5.29× compared to three state-of-the-art baselines.
📝 Abstract
Parameter-Efficient Fine-Tuning (PEFT) is widely applied as the backend of fine-tuning APIs for large language model (LLM) customization in datacenters. Service providers deploy separate instances for individual PEFT tasks, giving rise to prominent resource inefficiencies, including (1) GPU underutilization from small-scale, PEFT-native operators and (2) device stalls from communication delays and data dependencies in parallelized execution. To address these issues, this paper presents MuxTune, a fine-tuning system that enables resource-efficient concurrent execution of multiple PEFT tasks. The key idea is to multiplex the backbone across independent tasks in a spatial-temporal manner for improved utilization and reduced stalls. Building on flexible, modularized backbone sharing via unified PEFT representations, MuxTune proposes hierarchical co-scheduling scheme with task, operator, and data-level optimizations. Specifically, it fuses tasks through a hybrid of spatial and temporal multiplexing, and orchestrates multi-task operator execution in two-tiered hybrid parallelism. Additionally, MuxTune employs chunk-based data alignment to mitigate inter-task ineffective tokens. Experimental results demonstrate that MuxTune achieves up to $2.33\times$ higher throughput and $5.29\times$ memory reduction compared to three state-of-the-art baselines.