🤖 AI Summary
Users often struggle to accurately describe product defects in text but readily upload complaint videos. This paper introduces CoD-V, a novel task of generating structured, emotionally rich textual descriptions from user complaint videos. To support this task, we construct ComVID—the first benchmark dataset of user-generated complaint videos—and propose Complaint Retention (CR), a new evaluation metric that jointly measures factual accuracy and emotional expressiveness. Methodologically, we extend VideoLLaMA2-7B with multimodal Retrieval-Augmented Generation (RAG) to jointly model video semantics and user affective states. Experiments demonstrate that our approach significantly outperforms baselines across METEOR, perplexity, and readability metrics. This work establishes the first benchmark, open-source dataset, and evaluation framework for multimodal complaint understanding, advancing research on precise user intent articulation.
📝 Abstract
While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product' paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users' need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user's emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: https://github.com/sarmistha-D/CoD-V.