A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the fine-grained product recognition challenge in retail advertising flyers—characterized by high visual similarity among items and rapid category turnover—this paper introduces the first Visual Retrieval-Augmented Generation (Visual RAG) framework tailored for zero-shot and few-shot settings. The method integrates multi-source image parsing, lightweight vision-language models (e.g., GPT-4o-mini, Gemini 2.0 Flash), and retrieval-augmented inference, eliminating the need for model retraining; instead, it dynamically updates a RAG knowledge base to jointly predict product IDs, prices, and discount information for newly introduced items. Evaluated on diverse retail datasets, it achieves 86.8% fine-grained classification accuracy, substantially outperforming conventional fine-tuning and prompt-engineering approaches. Key contributions include: (1) pioneering the Visual RAG paradigm, decoupling recognition capability from parameter updates; (2) enabling real-time product expansion and structured multi-attribute output; and (3) empirically validating the efficacy and deployability of lightweight VLMs combined with retrieval in resource-constrained retail environments.

Technology Category

Application Category

📝 Abstract

Despite the rapid evolution of learning and computer vision algorithms, Fine-Grained Classification (FGC) still poses an open problem in many practically relevant applications. In the retail domain, for example, the identification of fast changing and visually highly similar products and their properties are key to automated price-monitoring and product recommendation. This paper presents a novel Visual RAG pipeline that combines the Retrieval Augmented Generation (RAG) approach and Vision Language Models (VLMs) for few-shot FGC. This Visual RAG pipeline extracts product and promotion data in advertisement leaflets from various retailers and simultaneously predicts fine-grained product ids along with price and discount information. Compared to previous approaches, the key characteristic of the Visual RAG pipeline is that it allows the prediction of novel products without re-training, simply by adding a few class samples to the RAG database. Comparing several VLM back-ends like GPT-4o [23], GPT-4o-mini [24], and Gemini 2.0 Flash [10], our approach achieves 86.8% accuracy on a diverse dataset.

Problem

Research questions and friction points this paper is trying to address.

Few-shot fine-grained product classification challenge

Automated price-monitoring for visually similar retail products

Predicting novel products without model re-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual RAG pipeline for few-shot FGC

Combines RAG and Vision Language Models

Predicts novel products without retraining

🔎 Similar Papers

No similar papers found.

Authors to Follow