Fixing Retrieval Issues in an AI Knowledge Base with BM25

The article discusses how the author's AI pipeline, built on top of Ollama, failed to retrieve the correct information from its knowledge base of markdown documents. The cosine similarity-based retrieval was unable to find the exact terms, leading to incorrect responses. The author explains how BM25, a classic information retrieval algorithm, can be used as a parallel path to address this issue.

šŸ’”

Why it matters

This approach addresses a common issue in AI systems that rely on knowledge bases, where exact technical terms can be difficult to retrieve using only semantic-based methods.

Key Points

  • 1Cosine similarity-based retrieval failed to find exact technical terms like model names and version strings
  • 2BM25 algorithm can retrieve documents based on exact term frequency, complementing the semantic retrieval of cosine
  • 3The solution is to run both cosine and BM25 in parallel and merge the results
  • 4BM25 is fast to rebuild from source documents, with no separate index management required

Details

The author built an AI pipeline with a knowledge base of markdown documents, aiming to allow the model to answer questions about its own project history using the documents as ground truth. However, the cosine similarity-based retrieval failed to find the correct information when the query contained exact technical terms like model names. The author explains that these terms don't have a meaningful semantic neighborhood in the embedding space, causing the cosine similarity to fail. To address this, the author introduces the use of the BM25 algorithm, which scores documents based on exact term frequency, weighted by document length and corpus statistics. BM25 can retrieve the documents containing the exact technical terms, complementing the semantic retrieval of cosine similarity. The implementation runs both cosine and BM25 in parallel and merges the results, providing a robust solution for the knowledge base queries.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies