Eval Debt Ledger: A Practical System for LLM Reliability Drift
TL;DR
If production failures aren’t translated into new evals, you’re borrowing reliability on credit. An eval debt ledger makes those gaps visible, prioritized, and reviewable each week.
Context
LLM teams often run postmortems but still repeat similar incidents: parser breaks, missing citations, policy edge cases, tool-calling failures. The root cause is often simple—no explicit mechanism to track what production taught you but your eval suite still misses.
Key Points
1) Treat missed failures as debt, not “nice-to-have”
Every incident class missed by evals should create a debt record with owner and due date.
2) Score debt by risk and recurrence
A useful priority score can combine:
- user impact,
- incident frequency,
- exploitability/safety risk,
- fix effort.
3) Keep debt tied to concrete artifacts
Each ledger item should reference:
- incident ID,
- failing prompt/input/output sample,
- expected behavior,
- new eval case location,
- pass/fail status.
4) Add “debt burn-down” to release readiness
Do not only ask “did metrics improve?” Ask “did we retire the highest-risk eval debt before this release?”
5) Make stale debt visible
If an item stays open past SLA, escalate it in weekly reliability review. Invisible debt is how regressions normalize.
Steps / Code
Simple eval debt ledger schema
debt_id,incident_id,category,severity,frequency,owner,due_date,status,eval_added,last_verified
ED-104,INC-889,tool_call_schema,high,recurring,ml-platform,2026-04-12,open,false,
ED-105,INC-892,citation_validity,medium,intermittent,quality-eng,2026-04-10,in_progress,true,2026-04-06
Weekly review checklist
1) Sort open debt by risk score.
2) Close items with merged eval coverage.
3) Re-run related regression suite.
4) Escalate overdue high-severity debt.
5) Report burn-down trend week over week.
Trade-offs
- Pro: Converts postmortem insights into durable safeguards.
- Pro: Improves release confidence over time.
- Con: Requires discipline to maintain metadata.
- Con: Can become bureaucratic if categories are too granular.
References
- OpenAI — Evaluating model performance: https://platform.openai.com/docs/guides/evals
- Google SRE — Postmortem culture: https://sre.google/sre-book/postmortem-culture/
- Anthropic — Building effective evals (engineering guidance): https://docs.anthropic.com/en/docs/build-with-claude/evals
Final Take
Incidents are inevitable. Repeated incidents are optional. An eval debt ledger is the bridge between the two.
Changelog
- 2026-04-06: Initial draft created and published.