🤖 AI Summary
This work addresses the reliability challenges of multimodal generative models when required to adhere to structured, domain-specific, or safety-critical knowledge. It introduces, for the first time, a four-layer knowledge injection framework grounded in a structural view of the generation process, decomposing it into input/output boundaries, transition functions, intermediate states, and model parameters—corresponding respectively to the surface, trajectory, latent space, and parameter layers. The authors establish principled guidelines for multi-layer combinatorial design and implement knowledge injection methods for the first three layers using diffusion models and multimodal knowledge graphs. Experimental results demonstrate that coordinated injection across these three layers reduces knowledge-violating outputs by 70.97%, confirming both the framework’s efficacy and the complementary roles of its constituent layers.
📝 Abstract
Multimodal generative models produce fluent outputs but remain unreliable when generation must respect structured, domain-specific, or safety-critical knowledge. Existing methods incorporate knowledge through mechanisms such as prompt augmentation, guidance, latent editing, or fine-tuning, yet they are typically categorized by technique rather than by the component of the generative process they modify. We argue that knowledge infusion in iterative generative models is fundamentally anintervention-layer problem. Since thegenerative process unfolds as a trajectory of internal states, knowledge can act on four structurally distinct components of this process: the input/output boundary, the transition function, the intermediate state, and the model parameters. This maps to four intervention layers: surface, trajectory, latent, and parametric infusion. We instantiate the framework in diffusion models, map representative methods to all four layers, and derive design principles for multi-layer composition. In a controlled safety-alignment experiment using a multimodal knowledge graph with two diffusion backbones, we implement three of the four layers cumulatively, surface (input-side and output-side) and trajectory--latent (mid-generation). We show empirically that each additional layer addresses failure classes that prior layers cannot reach, reducing knowledge-violating outputs by 70.97% compared to vanilla generation and empirically confirming the framework's complementarity prediction.