An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation

📅 2024-08-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing multimodal dialogue models for photo sharing typically adopt a pipeline architecture—image → text description → response/generation—which suffers from visual detail loss, error propagation, and disrupted end-to-end gradient flow due to discrete textual intermediaries. This work proposes the first end-to-end trainable vision-language dialogue generation model. It enhances visual perception via Q-Former, introduces a dynamic vocabulary transformation matrix, and bridges cross-modal gradients between LLMs and Stable Diffusion using Straight-through estimation and Gumbel-Softmax. The framework enables natural interleaved text responses and contextually relevant image generation within multi-turn dialogues. On PhotoChat and DialogCC, it achieves state-of-the-art performance on both textual and image generation metrics. Ablation studies confirm that the end-to-end architecture delivers critical gains in modeling temporal alignment and co-evolution of text and images across dialogue turns.

Technology Category

Application Category

📝 Abstract

Photo-Sharing Multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment. Using image text caption as the bridge, a pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task. However, representing the images with text captions may loss important visual details and information and cause error propagation in the complex dialogue system. Besides, the pipeline model isolates the three models separately because discrete image text captions hinder end-to-end gradient propagation. We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model. The large language model employs the Q-Former to perceive visual images in the input end. For image generation in the output end, we propose a dynamic vocabulary transformation matrix and use straight-through and gumbel-softmax techniques to align the large language model and stable diffusion model and achieve end-to-end gradient propagation. We perform experiments on PhotoChat and DialogCC datasets to evaluate our end-to-end model. Compared with pipeline models, the end-to-end model gains state-of-the-art performances on various metrics of text and image generation. More analysis experiments also verify the effectiveness of the end-to-end model for photo-sharing multi-modal dialogue generation.

Problem

Research questions and friction points this paper is trying to address.

Generates text and shares photos in multi-modal dialogues

Overcomes visual detail loss in pipeline models

Enables end-to-end gradient propagation for dialogue systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end model integrates image and text generation

Uses Q-Former for visual perception in LLM

Dynamic vocabulary aligns LLM and diffusion model

🔎 Similar Papers

No similar papers found.