← Home

Eval Debt Ledger: A Practical System for LLM Reliability Drift

Apr 6, 2026

TL;DR

If production failures aren’t translated into new evals, you’re borrowing reliability on credit. An eval debt ledger makes those gaps visible, prioritized, and reviewable each week.

Context

LLM teams often run postmortems but still repeat similar incidents: parser breaks, missing citations, policy edge cases, tool-calling failures. The root cause is often simple—no explicit mechanism to track what production taught you but your eval suite still misses.

Key Points

1) Treat missed failures as debt, not “nice-to-have”

Every incident class missed by evals should create a debt record with owner and due date.

2) Score debt by risk and recurrence

A useful priority score can combine:

3) Keep debt tied to concrete artifacts

Each ledger item should reference:

4) Add “debt burn-down” to release readiness

Do not only ask “did metrics improve?” Ask “did we retire the highest-risk eval debt before this release?”

5) Make stale debt visible

If an item stays open past SLA, escalate it in weekly reliability review. Invisible debt is how regressions normalize.

Steps / Code

Simple eval debt ledger schema

debt_id,incident_id,category,severity,frequency,owner,due_date,status,eval_added,last_verified
ED-104,INC-889,tool_call_schema,high,recurring,ml-platform,2026-04-12,open,false,
ED-105,INC-892,citation_validity,medium,intermittent,quality-eng,2026-04-10,in_progress,true,2026-04-06

Weekly review checklist

1) Sort open debt by risk score.
2) Close items with merged eval coverage.
3) Re-run related regression suite.
4) Escalate overdue high-severity debt.
5) Report burn-down trend week over week.

Trade-offs

References

Final Take

Incidents are inevitable. Repeated incidents are optional. An eval debt ledger is the bridge between the two.

Changelog