HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated extraction of pharmacokinetic (PK) tables from the literature faces significant challenges—including data heterogeneity, high noise levels, and substantial domain shift—undermining the reliability of computational pharmacology models. To address this, we propose a theory-driven large language model (LLM) adaptation framework that jointly integrates embedding-space similarity-weighted fine-tuning and manifold-aware denoising. Our method incorporates theoretically grounded constraints into the loss function, derived from generalization bounds and manifold regularization. This approach substantially enhances LLM robustness in parsing noisy PK tables, achieving an average accuracy improvement of 12.7% across multiple cross-domain PK benchmarks. Moreover, the framework provides inherent interpretability through its principled design. To our knowledge, this is the first work to establish a domain-adaptive theoretical framework for biomedical structured information extraction that simultaneously satisfies mathematical rigor and practical applicability.

Technology Category

Application Category

📝 Abstract
The extraction and standardization of pharmacokinetic (PK) information from scientific literature remain significant challenges in computational pharmacology, which limits the reliability of data-driven models in drug development. Large language models (LLMs) have achieved remarkable progress in text understanding and reasoning, yet their adaptation to structured biomedical data, such as PK tables, remains constrained by heterogeneity, noise, and domain shift. To address these limitations, we propose HySim-LLM, a unified mathematical and computational framework that integrates embedding-weighted fine-tuning and manifold-aware denoising to enhance the robustness and interpretability of LLMs. We establish two theoretical results: (1) a similarity-weighted generalization bound that quantifies adaptation performance under embedding divergence, and (2) a manifold-based denoising guarantee that bounds loss contributions from noisy or off-manifold samples. These theorems provide a principled foundation for fine-tuning LLMs in structured biomedical settings. The framework offers a mathematically grounded pathway toward reliable and interpretable LLM adaptation for biomedical and data-intensive scientific domains.
Problem

Research questions and friction points this paper is trying to address.

Extracting pharmacokinetic data from scientific literature remains challenging
Adapting LLMs to biomedical data faces heterogeneity and noise issues
Enhancing LLM robustness for structured biomedical information extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embedding-weighted fine-tuning enhances LLM adaptation
Manifold denoising reduces noise in biomedical data
Theoretical bounds quantify domain adaptation performance
🔎 Similar Papers
No similar papers found.
Majid Jaberi-Douraki
Majid Jaberi-Douraki
Kansas State University
Mathematical BiologyBig DataData ScienceOne Health1DATA
H
Hossein Sholehrasa
1DATA Consortium and FARAD Program, Kansas State University, Olathe, KS, USA; Department of Computer Science, Kansas State University, Manhattan, KS, USA
X
Xuan Xu
1DATA Consortium and FARAD Program, Kansas State University, Olathe, KS, USA; Department of Statistics, Kansas State University, Olathe, KS, USA
R
Remya Ampadi Ramachandran
1DATA Consortium and FARAD Program, Kansas State University, Olathe, KS, USA; Department of Mathematics, Kansas State University, Olathe, KS, USA