Offline Evaluation of RAG-Grounded Answers in LaunchDarkly AI Configs
This tutorial shows how to run an offline LLM evaluation on a RAG-grounded support agent built using LaunchDarkly AI Configs, Datasets, and LLM-as-a-judge scoring.
Why it matters
This offline evaluation approach helps catch generation quality issues, detect regressions, and compare candidate prompts and models before committing to a new AI Config variation.
Key Points
- 1Structure a RAG-grounded test dataset by pre-computing retrieval offline and bundling chunks into each row
- 2Pick the right LLM judge for the agent's output shape (Accuracy for natural-language answers, Likeness for structured labels)
- 3Avoid same-model bias by running the judge on a different model family than the agent
- 4Diagnose failing rows as dataset issues, agent issues, or judge calibration noise
Details
The tutorial covers how to build a RAG-grounded test dataset, run it through the LaunchDarkly Playground with a cross-family judge, and learn how to diagnose issues in the dataset, the agent, or the judge calibration. By pre-computing the RAG retrieval offline and baking the chunks directly into each dataset row, the Playground can evaluate the model's reasoning over real grounded input. The tutorial also explains how to pick the right LLM judge based on the agent's output shape, and how to avoid same-model bias by using a different model family for the judge.
No comments yet
Be the first to comment