Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large vision-language models (VLMs) suffer from catastrophic forgetting when incorporating domain-specific knowledge, degrading their general vision–language alignment capability. To address this, we propose Structured Dialogue Fine-Tuning (SDFT), a curriculum-based three-stage dialogue mechanism: (1) *Base Preservation*, maintaining foundational multimodal understanding; (2) *Ambiguity Discrimination*, explicitly distinguishing generic and domain-specific semantics; and (3) *Domain Specialization*, deepening domain-aware reasoning. SDFT introduces data-driven, stage-specific dialogue templates and a weighted multi-turn supervision framework, integrating counterfactual contrastive learning with chain-of-thought reasoning. Evaluated across multiple specialized domains, SDFT significantly enhances domain knowledge comprehension while preserving over 98.2% of the original vision–language alignment performance—effectively mitigating catastrophic forgetting.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT's effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.

Problem

Research questions and friction points this paper is trying to address.

Incorporating specialized knowledge without forgetting foundational abilities

Balancing domain-specific knowledge injection with general capability retention

Preventing catastrophic forgetting in large vision-language models during fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Dialogue Fine-Tuning for knowledge injection

Three-phase dialogue structure prevents catastrophic forgetting

Weighted multi-turn supervision balances knowledge integration

🔎 Similar Papers

No similar papers found.

Authors to Follow