3DS: Decomposed Difficulty Data Selection's Case Study on LLM Medical Domain Adaptation

📅 2024-10-13
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
To address the performance degradation of large language models (LLMs) in healthcare due to misalignment between supervised fine-tuning (SFT) data and the model’s intrinsic knowledge distribution, this paper proposes 3DS, a model-centric two-stage data selection framework. Methodologically, 3DS introduces: (1) an explicit alignment-based filtering mechanism grounded in the model’s internal knowledge; (2) a decoupled difficulty assessment model evaluating instruction understanding, response confidence, and response correctness along three orthogonal dimensions; and (3) attention-driven token-level importance weighting. Evaluated on real-world medical datasets, 3DS achieves over 5.29% absolute accuracy improvement over strong baselines—including GPT-4–generated annotations and human-curated selections—demonstrating its effectiveness in constructing high-quality, domain-specific SFT data. This work establishes a novel paradigm for efficient, model-aware data curation in specialized domains.

Technology Category

Application Category

📝 Abstract
Large Language Models(LLMs) excel in general tasks but struggle in specialized domains like healthcare due to limited domain-specific knowledge.Supervised Fine-Tuning(SFT) data construction for domain adaptation often relies on heuristic methods, such as GPT-4 annotation or manual data selection, with a data-centric focus on presumed diverse, high-quality datasets. However, these methods overlook the model's inherent knowledge distribution, introducing noise, redundancy, and irrelevant data, leading to a mismatch between the selected data and the model's learning task, resulting in suboptimal performance. To address this, we propose a two-stage model-centric data selection framework, Decomposed Difficulty Data Selection (3DS), which aligns data with the model's knowledge distribution for optimized adaptation. In Stage1, we apply Prompt-Driven Data Selection via Explicit Alignment, where the the model filters irrelevant or redundant data based on its internal knowledge. In Stage2, we perform Decomposed Difficulty Data Selection, where data selection is guided by our defined difficulty decomposition, using three metrics: Instruction Understanding, Response Confidence, and Response Correctness. Additionally, an attention-based importance weighting mechanism captures token importance for more accurate difficulty calibration. This two-stage approach ensures the selected data is not only aligned with the model's knowledge and preferences but also appropriately challenging for the model to learn, leading to more effective and targeted domain adaptation. In the case study of the medical domain, our extensive experiments on real-world healthcare datasets demonstrate the superiority of 3DS over exisiting methods in accuracy by over 5.29%. Our dataset and code will be open-sourced at https://anonymous.4open.science/r/3DS-E67F.
Problem

Research questions and friction points this paper is trying to address.

Optimizing medical domain adaptation of LLMs via data selection
Addressing noise and redundancy in supervised fine-tuning data
Aligning data with model's knowledge distribution for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage model-centric data selection framework
Prompt-driven filtering using internal knowledge
Decomposed difficulty metrics with attention weighting
🔎 Similar Papers
No similar papers found.
H
Hongxin Ding
Key Laboratory of High Confidence Software Technologies, Ministry of Education; School of Computer Science, Peking University, Beijing, China
Y
Yue Fang
Key Laboratory of High Confidence Software Technologies, Ministry of Education; School of Computer Science, Peking University, Beijing, China
R
Runchuan Zhu
Key Laboratory of High Confidence Software Technologies, Ministry of Education; School of Computer Science, Peking University, Beijing, China
X
Xinke Jiang
Key Laboratory of High Confidence Software Technologies, Ministry of Education; School of Computer Science, Peking University, Beijing, China
J
Jinyang Zhang
College of Computer Science, Zhejiang University, Hangzhou, China
Yongxin Xu
Yongxin Xu
Peking University
Large Language ModelsKnowledge GraphsElectronic Medical Record Analysis
X
Xu Chu
Key Laboratory of High Confidence Software Technologies, Ministry of Education; School of Computer Science, Peking University, Beijing, China
Junfeng Zhao
Junfeng Zhao
Assistant Professor at Arizona State University, Director of BELIV Lab
Connected & Automated VehicleMotion Planning & ControlsElectric VehiclesAI/ML
Y
Yasha Wang
Key Laboratory of High Confidence Software Technologies, Ministry of Education; School of Computer Science, Peking University, Beijing, China