Bridging the Gap: Extracting Tests from LLM Traces
This article discusses the challenge of evaluating the performance of large language models (LLMs) in production, and presents a tool called 'export-trace' that converts Jaeger traces into test datasets for regression testing.
Why it matters
Bridging the gap between observability and evaluation is crucial for ensuring the continuous improvement of LLMs in production.
Key Points
- 1LLM teams often lack a feedback loop to evaluate model performance after deployment
- 2Traces in observability tools like Jaeger provide valuable data, but are write-only
- 3The 'export-trace' tool bridges the gap by converting Jaeger traces into YAML test datasets
- 4The generated test cases include the input, expected output, and automatically generated assertions
Details
The article highlights the common workflow of LLM teams, where they deploy a new model version, monitor dashboards for a few hours, and then move on, without actually evaluating the output quality. This is because the production traces are 'write-only' - they are collected for observability but never used for regression testing. The author presents a tool called 'export-trace' that converts Jaeger traces into YAML test datasets, with each trace call becoming a separate test case. The generated test cases include the input, expected output, and automatically generated assertions based on the actual model responses. This allows teams to run the new model version against real production data and evaluate whether it has improved or regressed compared to the previous version.
No comments yet
Be the first to comment