When Words Can't Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Users often struggle to accurately describe product defects in text but readily upload complaint videos. This paper introduces CoD-V, a novel task of generating structured, emotionally rich textual descriptions from user complaint videos. To support this task, we construct ComVID—the first benchmark dataset of user-generated complaint videos—and propose Complaint Retention (CR), a new evaluation metric that jointly measures factual accuracy and emotional expressiveness. Methodologically, we extend VideoLLaMA2-7B with multimodal Retrieval-Augmented Generation (RAG) to jointly model video semantics and user affective states. Experiments demonstrate that our approach significantly outperforms baselines across METEOR, perplexity, and readability metrics. This work establishes the first benchmark, open-source dataset, and evaluation framework for multimodal complaint understanding, advancing research on precise user intent articulation.

Technology Category

Application Category

📝 Abstract

While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product' paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users' need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user's emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: https://github.com/sarmistha-D/CoD-V.

Problem

Research questions and friction points this paper is trying to address.

Generating expressive complaint descriptions from product defect videos

Addressing users' difficulty articulating complaints clearly through text alone

Creating video-based complaint mining to help users document product issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Retrieval-Augmented Generation for video complaint analysis

VideoLLaMA2-7b model adapted for emotional complaint generation

New complaint retention metric distinguishing from standard video tasks

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives