Improving Search Quality by Focusing on Upstream Data Preparation
The article discusses how the quality of semantic search results is more heavily influenced by the input data preparation than the choice of embedding model. It presents a case study from the author's company, showing a 40-point quality improvement by using LLM-generated structured summaries instead of raw profile data.
Why it matters
This article provides valuable insights for teams implementing semantic search or other AI-powered applications, highlighting the importance of data quality and upstream processing over model selection.
Key Points
- 1Embedding model benchmarks often don't reflect real-world production conditions
- 2Improving the quality of the embedding input can have a much bigger impact than changing the embedding model
- 3Decouple embedding input generation from the embedding call to enable model swapping and data preprocessing improvements
- 4Leverage cheap embedding models at query time and expensive LLM-based preprocessing at ingestion
Details
The article discusses the author's experience running semantic search for a recruitment marketplace. They found that the choice of embedding model had a relatively small impact (7-point spread) compared to the 40-point improvement gained by using LLM-generated structured summaries as the embedding input instead of raw profile data. This is because embedding models are trained on clean, purposeful text, while real-world production data often lacks narrative coherence. The article recommends focusing on upstream data preparation before optimizing the embedding model, and provides details on the author's architectural approach, including decoupling embedding input generation, amortizing expensive LLM processing at ingestion, and using Postgres with pgvector for vector storage and hybrid filtering.
No comments yet
Be the first to comment