Fixing Retrieval Issues in an AI Knowledge Base with BM25
The article discusses how the author's AI pipeline, built on top of Ollama, failed to retrieve the correct information from its knowledge base of markdown documents. The cosine similarity-based retrieval was unable to find the exact terms, leading to incorrect responses. The author explains how BM25, a classic information retrieval algorithm, can be used as a parallel path to address this issue.
Why it matters
This approach addresses a common issue in AI systems that rely on knowledge bases, where exact technical terms can be difficult to retrieve using only semantic-based methods.
Key Points
- 1Cosine similarity-based retrieval failed to find exact technical terms like model names and version strings
- 2BM25 algorithm can retrieve documents based on exact term frequency, complementing the semantic retrieval of cosine
- 3The solution is to run both cosine and BM25 in parallel and merge the results
- 4BM25 is fast to rebuild from source documents, with no separate index management required
Details
The author built an AI pipeline with a knowledge base of markdown documents, aiming to allow the model to answer questions about its own project history using the documents as ground truth. However, the cosine similarity-based retrieval failed to find the correct information when the query contained exact technical terms like model names. The author explains that these terms don't have a meaningful semantic neighborhood in the embedding space, causing the cosine similarity to fail. To address this, the author introduces the use of the BM25 algorithm, which scores documents based on exact term frequency, weighted by document length and corpus statistics. BM25 can retrieve the documents containing the exact technical terms, complementing the semantic retrieval of cosine similarity. The implementation runs both cosine and BM25 in parallel and merges the results, providing a robust solution for the knowledge base queries.
No comments yet
Be the first to comment