Harnessing Source Heterogeneity for Cluster-Structured Transfer Learning

πŸ“… 2026-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

210K/year
πŸ€– AI Summary
This work addresses the challenge of transferring knowledge from heterogeneous multi-source data when target data are scarce and source domains exhibit divergent clustering structuresβ€”a scenario where conventional transfer learning methods struggle. The authors propose Trans-GLMC, a novel approach that explicitly models source-domain heterogeneity through latent cluster structures, leveraging coefficient distances to identify structural similarities across sources. By integrating global parameters, refining within-cluster estimates, and correcting target-specific bias, Trans-GLMC enables structure-aware transfer for generalized linear models. This framework moves beyond traditional binary source selection paradigms, allowing adaptive utilization of source subgroups based on their transferability. Evaluated on the CHIME suicide risk prediction task, the method significantly improves hospital-level predictive performance, uncovers interpretable hospital migration communities, and recovers risk factors consistent with clinical knowledge.
πŸ“ Abstract
Transfer learning is a natural strategy when a target population has limited data but multiple related auxiliary sources are available. A central difficulty is source heterogeneity: auxiliary sources may not be equally useful, and their usefulness may vary in a structured, cluster-like fashion. Existing transfer-learning methods often reduce source selection to a binary informative/non-informative decision, overlooking subgroups of sources with differential transferability. Motivated by a suicide-risk study using data from the Connecticut Hospital Information Management Exchange (CHIME), comprising 636,758 patients across 27 hospitals, we propose Trans-GLMC, a cluster-structured transfer-learning procedure for generalized linear models. The CHIME setting illustrates the core challenge: hospital-specific risk models are unstable because suicide attempts are rare at any single facility, whereas indiscriminate pooling across hospitals can obscure facility-level differences in patient mix and risk profiles. Trans-GLMC first constructs a coefficient-based distance among the target and candidate sources to recover latent source clusters. It then combines global fusion, within-cluster refinement, and target debiasing to produce an estimator that adapts to the detected structure. We establish a non-asymptotic error bound that improves over its unclustered counterpart whenever a meaningful target cluster exists and matches the unclustered rate up to constants otherwise. In simulations and in the CHIME study, Trans-GLMC improves facility-specific prediction, identifies interpretable communities of hospitals with mutual transferability, and recovers clinically coherent suicide-risk factors.
Problem

Research questions and friction points this paper is trying to address.

transfer learning
source heterogeneity
cluster structure
generalized linear models
data pooling
Innovation

Methods, ideas, or system contributions that make the work stand out.

cluster-structured transfer learning
source heterogeneity
Trans-GLMC
generalized linear models
coefficient-based clustering