VideoAVE: A Multi-Attribute Video-to-Text Attribute Value Extraction Dataset and Benchmark Models

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Attribute Value Extraction (AVE) datasets are limited to text/image-to-text paradigms, lacking support for e-commerce videos, attribute diversity, and public accessibility. Method: We introduce VideoAVE, the first publicly available video-to-text AVE dataset, covering 14 product categories and 172 open-ended attributes, with 224K training and 25K evaluation samples. We propose CLIP-MoE, a post-processing filtering mechanism that integrates multimodal vision-language models to jointly predict attribute-conditioned values and extract open-ended attribute–value pairs. Contribution/Results: VideoAVE establishes the first comprehensive benchmark for open attribute extraction from videos. Experiments reveal substantial performance gaps for current models on video inputs, underscoring the dataset’s critical role in advancing research on video-based AVE.

Technology Category

Application Category

📝 Abstract
Attribute Value Extraction (AVE) is important for structuring product information in e-commerce. However, existing AVE datasets are primarily limited to text-to-text or image-to-text settings, lacking support for product videos, diverse attribute coverage, and public availability. To address these gaps, we introduce VideoAVE, the first publicly available video-to-text e-commerce AVE dataset across 14 different domains and covering 172 unique attributes. To ensure data quality, we propose a post-hoc CLIP-based Mixture of Experts filtering system (CLIP-MoE) to remove the mismatched video-product pairs, resulting in a refined dataset of 224k training data and 25k evaluation data. In order to evaluate the usability of the dataset, we further establish a comprehensive benchmark by evaluating several state-of-the-art video vision language models (VLMs) under both attribute-conditioned value prediction and open attribute-value pair extraction tasks. Our results analysis reveals that video-to-text AVE remains a challenging problem, particularly in open settings, and there is still room for developing more advanced VLMs capable of leveraging effective temporal information. The dataset and benchmark code for VideoAVE are available at: https://github.com/gjiaying/VideoAVE
Problem

Research questions and friction points this paper is trying to address.

Extracting attribute values from e-commerce videos
Addressing lack of video-to-text AVE datasets
Evaluating video VLMs for attribute extraction tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP-based Mixture of Experts filtering
Multi-domain video-to-text dataset creation
Benchmarking video vision language models
🔎 Similar Papers
No similar papers found.
Ming Cheng
Ming Cheng
Dartmouth College
T
Tong Wu
Virginia Tech, Blacksburg, Virginia, USA
J
Jiazhen Hu
Virginia Tech, Blacksburg, Virginia, USA
J
Jiaying Gong
Virginia Tech, Blacksburg, Virginia, USA
Hoda Eldardiry
Hoda Eldardiry
Associate Professor of Computer Science, Virginia Tech
Machine Learning