PRISM: Product Retrieval In Shopping Carts using Hybrid Matching

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In retail scenarios, high-similarity products—such as same-category items from different brands—suffer from low retrieval accuracy and inefficiency due to viewpoint variations and ambiguous local details. To address this, we propose a three-stage hybrid retrieval framework: (1) semantic coarse filtering using SigLIP; (2) foreground-aware object segmentation via YOLO-E to suppress background interference; and (3) lightweight pixel-level local feature matching within the candidate set using LightGlue. This design overcomes the limitation of global vision-language models in modeling fine-grained discriminative cues. Evaluated on the ABV dataset, our method achieves a 4.21% absolute improvement in top-1 accuracy over the state-of-the-art, while maintaining real-time inference capability. The framework thus delivers both high precision and practical deployability for large-scale retail visual search applications.

Technology Category

Application Category

📝 Abstract
Compared to traditional image retrieval tasks, product retrieval in retail settings is even more challenging. Products of the same type from different brands may have highly similar visual appearances, and the query image may be taken from an angle that differs significantly from view angles of the stored catalog images. Foundational models, such as CLIP and SigLIP, often struggle to distinguish these subtle but important local differences. Pixel-wise matching methods, on the other hand, are computationally expensive and incur prohibitively high matching times. In this paper, we propose a new, hybrid method, called PRISM, for product retrieval in retail settings by leveraging the advantages of both vision-language model-based and pixel-wise matching approaches. To provide both efficiency/speed and finegrained retrieval accuracy, PRISM consists of three stages: 1) A vision-language model (SigLIP) is employed first to retrieve the top 35 most semantically similar products from a fixed gallery, thereby narrowing the search space significantly; 2) a segmentation model (YOLO-E) is applied to eliminate background clutter; 3) fine-grained pixel-level matching is performed using LightGlue across the filtered candidates. This framework enables more accurate discrimination between products with high inter-class similarity by focusing on subtle visual cues often missed by global models. Experiments performed on the ABV dataset show that our proposed PRISM outperforms the state-of-the-art image retrieval methods by 4.21% in top-1 accuracy while still remaining within the bounds of real-time processing for practical retail deployments.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing visually similar retail products with subtle differences
Overcoming computational inefficiency in pixel-wise matching methods
Improving retrieval accuracy while maintaining real-time processing speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid vision-language and pixel matching
Three-stage filtering with segmentation
Real-time fine-grained product retrieval
🔎 Similar Papers
No similar papers found.