🤖 AI Summary
Search on Mercari—the largest Japanese C2C marketplace—faces challenges including short, ambiguous user queries, noisy item titles, and stringent online latency constraints. Method: We propose a domain-aware text embedding method tailored for Japanese C2C search. It introduces role-specific prefixes to model semantic asymmetry between queries and titles, integrated with Matryoshka representation learning to yield compact, truncation-robust embeddings. The model is fine-tuned on query-title pairs derived from real purchase behavior, augmented by prompt engineering and log-driven evaluation. Contribution/Results: Offline experiments demonstrate substantial gains over general-purpose encoders. Human evaluation confirms improvements in proper noun recognition, platform-specific semantic understanding, and term importance modeling. Online A/B testing shows statistically significant increases in revenue per user and search efficiency (p < 0.01).
📝 Abstract
Consumer-to-consumer (C2C) marketplaces pose distinct retrieval challenges: short, ambiguous queries; noisy, user-generated listings; and strict production constraints. This paper reports our experiment to build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace. We experimented with fine-tuning on purchase-driven query-title pairs, using role-specific prefixes to model query-item asymmetry. To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings. Offline evaluation on historical search logs shows consistent gains over a strong generic encoder, with particularly large improvements when replacing PCA compression with Matryoshka truncation. A manual assessment further highlights better handling of proper nouns, marketplace-specific semantics, and term-importance alignment. Additionally, an initial online A/B test demonstrates statistically significant improvements in revenue per user and search-flow efficiency, with transaction frequency maintained. Results show that domain-aware embeddings improve relevance and efficiency at scale and form a practical foundation for richer LLM-era search experiences.