Towards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Search on Mercari—the largest Japanese C2C marketplace—faces challenges including short, ambiguous user queries, noisy item titles, and stringent online latency constraints. Method: We propose a domain-aware text embedding method tailored for Japanese C2C search. It introduces role-specific prefixes to model semantic asymmetry between queries and titles, integrated with Matryoshka representation learning to yield compact, truncation-robust embeddings. The model is fine-tuned on query-title pairs derived from real purchase behavior, augmented by prompt engineering and log-driven evaluation. Contribution/Results: Offline experiments demonstrate substantial gains over general-purpose encoders. Human evaluation confirms improvements in proper noun recognition, platform-specific semantic understanding, and term importance modeling. Online A/B testing shows statistically significant increases in revenue per user and search efficiency (p < 0.01).

Technology Category

Application Category

📝 Abstract

Consumer-to-consumer (C2C) marketplaces pose distinct retrieval challenges: short, ambiguous queries; noisy, user-generated listings; and strict production constraints. This paper reports our experiment to build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace. We experimented with fine-tuning on purchase-driven query-title pairs, using role-specific prefixes to model query-item asymmetry. To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings. Offline evaluation on historical search logs shows consistent gains over a strong generic encoder, with particularly large improvements when replacing PCA compression with Matryoshka truncation. A manual assessment further highlights better handling of proper nouns, marketplace-specific semantics, and term-importance alignment. Additionally, an initial online A/B test demonstrates statistically significant improvements in revenue per user and search-flow efficiency, with transaction frequency maintained. Results show that domain-aware embeddings improve relevance and efficiency at scale and form a practical foundation for richer LLM-era search experiences.

Problem

Research questions and friction points this paper is trying to address.

Develop domain-aware Japanese text embeddings for C2C marketplace search

Address challenges of short queries, noisy listings, and production constraints

Improve search relevance and efficiency while maintaining transaction frequency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning on purchase-driven query-title pairs

Using role-specific prefixes for query-item asymmetry

Applying Matryoshka Representation Learning for compact embeddings

🔎 Similar Papers

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval