Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Search log data from platforms like Facebook Marketplace suffer from limited diversity and insufficient semantic detail, constraining the semantic matching capability of embedding-based retrieval (EBR) models. Method: We propose the first LLM-powered, multimodal, multi-task synthetic data augmentation framework specifically optimized for EBR. It introduces a novel three-stage synthesis strategy: (1) LLM-generated diverse queries, (2) enhancement of item-page content, and (3) reverse inference of associated queries—ensuring high-quality, low-hallucination, semantically relevant, and diverse synthetic data. Results: Training an EBR model on only 100M LLM-synthesized samples yields a +4% improvement in ROC_AUC—outperforming models trained on original logs or mixed-data baselines. This demonstrates a paradigm shift beyond conventional data augmentation and validates the effectiveness and robustness of LLM-synthesized data in industrial-scale search applications.

Technology Category

Application Category

📝 Abstract
Embedding-Based Retrieval (EBR) is an important technique in modern search engines, enabling semantic match between search queries and relevant results. However, search logging data on platforms like Facebook Marketplace lacks the diversity and details needed for effective EBR model training, limiting the models' ability to capture nuanced search patterns. To address this challenge, we propose Aug2Search, an EBR-based framework leveraging synthetic data generated by Generative AI (GenAI) models, in a multimodal and multitask approach to optimize query-product relevance. This paper investigates the capabilities of GenAI, particularly Large Language Models (LLMs), in generating high-quality synthetic data, and analyzing its impact on enhancing EBR models. We conducted experiments using eight Llama models and 100 million data points from Facebook Marketplace logs. Our synthetic data generation follows three strategies: (1) generate queries, (2) enhance product listings, and (3) generate queries from enhanced listings. We train EBR models on three different datasets: sampled engagement data or original data ((e.g.,"Click"and"Listing Interactions")), synthetic data, and a mixture of both engagement and synthetic data to assess their performance across various training sets. Our findings underscore the robustness of Llama models in producing synthetic queries and listings with high coherence, relevance, and diversity, while maintaining low levels of hallucination. Aug2Search achieves an improvement of up to 4% in ROC_AUC with 100 million synthetic data samples, demonstrating the effectiveness of our approach. Moreover, our experiments reveal that with the same volume of training data, models trained exclusively on synthetic data often outperform those trained on original data only or a mixture of original and synthetic data.
Problem

Research questions and friction points this paper is trying to address.

Enhancing EBR model training with limited diverse search logging data
Generating high-quality synthetic data using LLMs for query-product relevance
Improving search performance via synthetic data augmentation in Facebook Marketplace
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated synthetic data enhances EBR training
Multimodal multitask approach optimizes query relevance
Synthetic data outperforms original in model performance
🔎 Similar Papers
No similar papers found.