Rollback Budget: The Missing Guardrail in LLM Rollouts
TL;DR
Most teams define launch criteria but not rollback criteria. A rollback budget (how much degradation you tolerate, for how long) makes incident response faster and protects users from slow-motion reliability drift.
Context
In many LLM deployments, shipping is disciplined but rollback is improvised. Teams have canary gates, dashboards, and scorecards—yet when things degrade after release, they debate too long because nobody agreed on the “stop loss” rule.
A rollback budget closes that gap.
Key Points
1) A rollback budget is a stop-loss, not a punishment
It is a pre-agreed reliability envelope that says: if service quality falls beyond X for Y minutes, rollback is automatic unless an incident commander explicitly overrides.
2) Use user-impact metrics first
Budget triggers should be tied to outcomes users feel:
- task failure increase,
- escalation/retry rate,
- severe safety errors,
- timeout/latency regressions.
Internal metrics still matter, but they should not be the only triggers.
3) Define both magnitude and duration
Avoid noisy reversals by requiring both:
- magnitude threshold (e.g., success rate down >2%), and
- duration threshold (e.g., sustained for 20 minutes).
This avoids rolling back on random blips while still stopping persistent harm.
4) Separate auto-rollback from manual review
Not every breach needs instant rollback. Use tiers:
- Tier A: severe safety/compliance regressions → immediate rollback.
- Tier B: quality/cost/performance drift → short manual decision window.
5) Log every budget breach as eval debt
If a breach happened in production but wasn’t caught pre-release, create new eval cases immediately. Otherwise incidents repeat.
Steps / Code
Minimal rollback budget spec
release: model-v42
baseline: model-v41
window: 24h
rollback_budget:
quality:
task_success_delta_pct: -2.0
max_duration_min: 20
safety:
high_severity_incidents_increase: 0
action: immediate_rollback
latency:
p95_delta_pct: +15
max_duration_min: 30
escalation:
human_handoff_delta_pct: +8
max_duration_min: 20
decision:
tier_a_breach: auto_rollback
tier_b_breach: incident_commander_decision_within_15m
Operating rule
- Set budget before rollout.
- Attach owner/on-call.
- Enforce trigger automatically where possible.
- Convert breaches into new eval coverage.
Trade-offs
- Pro: Reduces delayed rollback caused by debate.
- Pro: Protects user experience during uncertain launches.
- Con: Poorly set thresholds can cause unnecessary reversions.
- Con: Needs clean baseline instrumentation to work well.
References
- Google SRE Workbook — Canarying releases: https://sre.google/workbook/canarying-releases/
- OpenAI — Evals design guide: https://platform.openai.com/docs/guides/evals-design
- NIST — AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
Final Take
A launch gate tells you when to ship. A rollback budget tells you when to stop. Mature LLM operations require both.
Changelog
- 2026-04-05: Initial draft created and published.