Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (VLMs) suffer from catastrophic forgetting when incorporating domain-specific knowledge, degrading their general vision–language alignment capability. To address this, we propose Structured Dialogue Fine-Tuning (SDFT), a curriculum-based three-stage dialogue mechanism: (1) *Base Preservation*, maintaining foundational multimodal understanding; (2) *Ambiguity Discrimination*, explicitly distinguishing generic and domain-specific semantics; and (3) *Domain Specialization*, deepening domain-aware reasoning. SDFT introduces data-driven, stage-specific dialogue templates and a weighted multi-turn supervision framework, integrating counterfactual contrastive learning with chain-of-thought reasoning. Evaluated across multiple specialized domains, SDFT significantly enhances domain knowledge comprehension while preserving over 98.2% of the original vision–language alignment performance—effectively mitigating catastrophic forgetting.

Technology Category

Application Category

📝 Abstract
Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT's effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.
Problem

Research questions and friction points this paper is trying to address.

Incorporating specialized knowledge without forgetting foundational abilities
Balancing domain-specific knowledge injection with general capability retention
Preventing catastrophic forgetting in large vision-language models during fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Dialogue Fine-Tuning for knowledge injection
Three-phase dialogue structure prevents catastrophic forgetting
Weighted multi-turn supervision balances knowledge integration
🔎 Similar Papers
No similar papers found.
Y
Yijie Hong
Shanghai Jiao Tong University
X
Xiaofei Yin
Ant Security Lab, Ant Group
X
Xinzhong Wang
Shanghai Jiao Tong University
Yi Tu
Yi Tu
Ant Group
Computer VisionDocument UnderstandingVision Language Model
Y
Ya Guo
Ant Security Lab, Ant Group
S
Sufeng Duan
Shanghai Jiao Tong University
W
Weiqiang Wang
Ant Security Lab, Ant Group
L
Lingyong Fang
Shanghai Jiao Tong University
D
Depeng Wang
Ant Security Lab, Ant Group
H
Huijia Zhu
Ant Security Lab, Ant Group