🤖 AI Summary
To address insufficient cross-client domain coverage, performance limitations, and privacy risks in instruction-tuning large language models (LLMs) under federated learning, this paper identifies domain coverage—not data heterogeneity—as the dominant factor governing FedDIT performance. We propose FedDCA and its lightweight variant FedDCA*, which jointly optimize distributed domain coverage via greedy central client selection and retrieval-augmented generation (RAG)-enabled feature alignment at the server. The framework integrates federated learning, instruction tuning, RAG, and heterogeneous encoder design, accompanied by empirical analysis of privacy leakage patterns. Experiments across code, healthcare, finance, and mathematics domains demonstrate significant performance gains. We further verify that privacy risk converges with fine-tuning rounds and that defense efficacy is independent of public dataset size.
📝 Abstract
Federated Domain-specific Instruction Tuning (FedDIT) utilizes limited cross-client private data together with server-side public data for instruction augmentation, ultimately boosting model performance within specific domains. To date, the factors affecting FedDIT remain unclear, and existing instruction augmentation methods primarily focus on the centralized setting without considering distributed environments. Our experiments reveal that the cross-client domain coverage, rather than data heterogeneity, drives model performance in FedDIT. In response, we propose FedDCA, which optimizes domain coverage through greedy client center selection and retrieval-based augmentation. For client-side computational efficiency and system scalability, FedDCA$^*$, the variant of FedDCA, utilizes heterogeneous encoders with server-side feature alignment. Extensive experiments across four distinct domains (code, medical, financial, and mathematical) substantiate the effectiveness of both methods. Additionally, we investigate privacy preservation against memory extraction attacks utilizing various amounts of public data. Results show that there is no significant correlation between the volume of public data and the privacy-preserving capability. However, as the fine-tuning rounds increase, the risk of privacy leakage reduces or converges.