Debugging an LLM Bug at 3 AM: The Runbook I Wish I'd Had
This article provides a detailed runbook for quickly diagnosing and resolving issues with large language models (LLMs) in production. It outlines a step-by-step process to identify the root cause of an LLM incident, covering provider availability, model quality, self-inflicted issues, cost, and regulatory/reputational concerns.
Why it matters
This runbook provides a valuable framework for efficiently troubleshooting and resolving issues with LLMs in production, which are becoming increasingly critical to many applications and services.
Key Points
- 1Avoid immediately debugging the model and instead focus on understanding the shape of the change
- 2Run three key commands to determine if the issue is with upstream providers, your own traffic, or a recent deployment
- 3Formulate a hypothesis and communicate it to the incident channel to coordinate the response
Details
The author shares a runbook they wish they had when writing a book on LLM observability. The runbook covers a scenario where an engineer is paged at 3 AM due to a drop in the average LLM judge score. It outlines a structured approach to quickly diagnose the issue, starting with checking the status of upstream providers, analyzing the recent traffic patterns, and reviewing any recent deployments or code changes. The article emphasizes the importance of not immediately diving into the model itself, as debugging a distributed system requires a broader perspective. Instead, the focus is on understanding the shape of the change, which can fall into one of five categories: provider availability, provider quality, self-inflicted quality, cost, or regulatory/reputational issues. By following the three-command triage process, the engineer can quickly identify the likely root cause and formulate a hypothesis to share with the incident response team.
No comments yet
Be the first to comment