Towards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces
By: Andre Rusli , Miao Cao , Shoma Ishimoto and more
Consumer-to-consumer (C2C) marketplaces pose distinct retrieval challenges: short, ambiguous queries; noisy, user-generated listings; and strict production constraints. This paper reports our experiment to build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace. We experimented with fine-tuning on purchase-driven query-title pairs, using role-specific prefixes to model query-item asymmetry. To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings. Offline evaluation on historical search logs shows consistent gains over a strong generic encoder, with particularly large improvements when replacing PCA compression with Matryoshka truncation. A manual assessment further highlights better handling of proper nouns, marketplace-specific semantics, and term-importance alignment. Additionally, an initial online A/B test demonstrates statistically significant improvements in revenue per user and search-flow efficiency, with transaction frequency maintained. Results show that domain-aware embeddings improve relevance and efficiency at scale and form a practical foundation for richer LLM-era search experiences.
Similar Papers
Adaptation of Embedding Models to Financial Filings via LLM Distillation
Computation and Language
Teaches AI to find specific money information faster.
A Data-Centric Approach to Multilingual E-Commerce Product Search: Case Study on Query-Category and Query-Item Relevance
Information Retrieval
Helps online stores understand shoppers in any language.
Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data
Machine Learning (CS)
Makes computers remember information faster and better.