Bridging the Gap: Extracting Tests from LLM Traces

This article discusses the challenge of evaluating the performance of large language models (LLMs) in production, and presents a tool called 'export-trace' that converts Jaeger traces into test datasets for regression testing.

💡

Why it matters

Bridging the gap between observability and evaluation is crucial for ensuring the continuous improvement of LLMs in production.

Key Points

  • 1LLM teams often lack a feedback loop to evaluate model performance after deployment
  • 2Traces in observability tools like Jaeger provide valuable data, but are write-only
  • 3The 'export-trace' tool bridges the gap by converting Jaeger traces into YAML test datasets
  • 4The generated test cases include the input, expected output, and automatically generated assertions

Details

The article highlights the common workflow of LLM teams, where they deploy a new model version, monitor dashboards for a few hours, and then move on, without actually evaluating the output quality. This is because the production traces are 'write-only' - they are collected for observability but never used for regression testing. The author presents a tool called 'export-trace' that converts Jaeger traces into YAML test datasets, with each trace call becoming a separate test case. The generated test cases include the input, expected output, and automatically generated assertions based on the actual model responses. This allows teams to run the new model version against real production data and evaluate whether it has improved or regressed compared to the previous version.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies