LLM-Powered Relevance Assessment for Pinterest Search
Pinterest uses fine-tuned large language models (LLMs) to scale relevance labeling and improve search quality evaluation in online A/B experiments.
Why it matters
This work demonstrates how LLMs can be leveraged to scale relevance labeling and improve search quality measurement, which is crucial for personalized search systems.
Key Points
- 1Pinterest measures search relevance using a 5-level guideline and fine-tunes open-source LLMs to predict relevance scores
- 2LLM labeling significantly reduces labeling costs and enables a stratified sampling design to measure heterogeneous treatment effects
- 3The stratified sampling approach and LLM-powered labeling reduced the minimum detectable effect (MDE) by an order of magnitude
Details
Pinterest tracks whole-page relevance in online A/B experiments to evaluate new ranking models. Relevance measurement typically relies on human annotations, which is limited by low availability and high cost. To address this, Pinterest fine-tunes open-source LLMs on relevance prediction tasks using human-annotated labels. The fine-tuned LLMs are then used to evaluate ranking results across experimental groups, significantly reducing labeling costs and improving evaluation efficiency. Additionally, the authors leverage a stratified query sampling design enabled by the scalable LLM labeling, which reduces the minimum detectable effect (MDE) by an order of magnitude compared to the previous approach.
No comments yet
Be the first to comment