Dev.to LLM3h ago|Research & Papers Products & Services

Fixing Retrieval Issues in an AI Knowledge Base with BM25

The article discusses how the author's AI pipeline, built on top of Ollama, failed to retrieve the correct information from its knowledge base of markdown documents. The cosine similarity-based retrieval was unable to find the exact terms, leading to incorrect responses. The author explains how BM25, a classic information retrieval algorithm, can be used as a parallel path to address this issue.

💡

Why it matters

This approach addresses a common issue in AI systems that rely on knowledge bases, where exact technical terms can be difficult to retrieve using only semantic-based methods.

Key Points

1Cosine similarity-based retrieval failed to find exact technical terms like model names and version strings
2BM25 algorithm can retrieve documents based on exact term frequency, complementing the semantic retrieval of cosine
3The solution is to run both cosine and BM25 in parallel and merge the results
4BM25 is fast to rebuild from source documents, with no separate index management required

Details

The author built an AI pipeline with a knowledge base of markdown documents, aiming to allow the model to answer questions about its own project history using the documents as ground truth. However, the cosine similarity-based retrieval failed to find the correct information when the query contained exact technical terms like model names. The author explains that these terms don't have a meaningful semantic neighborhood in the embedding space, causing the cosine similarity to fail. To address this, the author introduces the use of the BM25 algorithm, which scores documents based on exact term frequency, weighted by document length and corpus statistics. BM25 can retrieve the documents containing the exact technical terms, complementing the semantic retrieval of cosine similarity. The implementation runs both cosine and BM25 in parallel and merges the results, providing a robust solution for the knowledge base queries.

Fixing Retrieval Issues in an AI Knowledge Base with BM25

Why it matters

Key Points

Details

Dive deeper

Related Articles

New LLM Releases That Are Changing the Game

How Multi-Agent Systems Are Reshaping Software Development

AI Breakthroughs in Memory, Assistants, and Decision-Making

Why Your Agent's Eval Suite Won't Catch Production Failures

The Hidden Costs of AI Agents: Optimizing for Successful Ou…

Challenges of Multi-Agent AI Systems

Building an Industrial AI Assistant with Amazon Bedrock Age…

Amazon Bedrock AgentCore Evaluations: LLM-as-a-Judge in Pro…

Automatically Convert APIs to MCP Tools with mcp-server-ope…

Building an AI Nervous System: Crons, Skills, and Autonomou…

AI Curator

Ask me anything about AI

Related Articles

New LLM Releases That Are Changing the Game

How Multi-Agent Systems Are Reshaping Software Development

AI Breakthroughs in Memory, Assistants, and Decision-Making

Why Your Agent's Eval Suite Won't Catch Production Failures

The Hidden Costs of AI Agents: Optimizing for Successful Ou…

Challenges of Multi-Agent AI Systems

Building an Industrial AI Assistant with Amazon Bedrock Age…

Amazon Bedrock AgentCore Evaluations: LLM-as-a-Judge in Pro…

Automatically Convert APIs to MCP Tools with mcp-server-ope…

Building an AI Nervous System: Crons, Skills, and Autonomou…