From Uptime to Outcomes: SLOs That Actually Work for LLM Apps
TL;DR
If your only SLOs are latency and uptime, your LLM app can be “green” while users still get bad answers. Keep classic infra SLOs, but add outcome SLOs for task success, grounded responses, and safe fallback behavior. Then gate releases on those metrics.
Context
Most production teams already track standard SRE indicators: availability, latency, and error rate. That baseline is still necessary. But LLM systems fail in ways that these indicators miss: plausible nonsense, poor instruction-following, and brittle behavior on edge prompts.
Google’s SRE framework is explicit: an SLI should measure what users actually experience, and sometimes server-side metrics are only a proxy. For LLM apps, that warning is the whole point. “200 OK in 500 ms” is not equivalent to “user got a useful, correct answer.”
At the same time, guidance from NIST AI RMF and current model-eval playbooks points toward continuous evaluation and risk-aware operation, not one-time testing. So the practical move is to define reliability targets that include user-facing quality, not just platform health.
Key Points
1) Split reliability into two layers
Use two explicit SLO layers:
- Platform SLOs: availability, latency, infrastructure errors.
- Outcome SLOs: answer usefulness, factual grounding/citation quality, safe refusal/escalation on risky or uncertain requests.
You need both. Platform SLOs protect system responsiveness; outcome SLOs protect user trust.
2) Pick outcome SLIs you can measure weekly
Good outcome SLIs are:
- observable,
- stable enough to trend,
- tied to a real user outcome.
A minimal set for many assistant products:
- Task success rate on a fixed eval set.
- Groundedness pass rate (claims supported by provided context or references).
- Unsafe failure rate (high-severity policy or risk violations).
- Escalation quality (when uncertain, model abstains or routes to human/process instead of bluffing).
3) Define a reliability budget for quality regressions
SRE teams use error budgets; LLM teams should mirror that concept for outcome quality.
Example:
- Monthly task-success SLO: >= 92% on core eval set.
- Budget: 8 percentage points of allowed miss.
- If burn rate exceeds threshold, freeze risky launches and prioritize reliability fixes.
This turns “it feels worse lately” into operational policy.
4) Gate launches with control-vs-candidate comparisons
Before promoting model/prompt/retrieval changes:
- Run offline evals against baseline.
- Canary in production.
- Compare candidate vs control on both platform and outcome SLIs.
- Promote only if outcome deltas pass predefined thresholds.
This aligns with current eval guidance: describe desired behavior, test with representative data, analyze results, iterate.
5) Treat uncertainty behavior as a first-class reliability signal
A high-performing assistant should sometimes say “I don’t know.” If your system never abstains, it is likely overconfident.
Operationalize this by tracking:
- abstention rate on unanswerable items,
- hallucination rate on fact-check subsets,
- citation/quote support coverage for factual claims.
Steps / Code
Starter outcome SLO spec (example)
service: support-assistant
window: 28d
slo_layers:
platform:
availability: ">=99.9%"
p95_latency_ms: "<=1800"
outcome:
task_success_core_eval: ">=0.92"
grounded_claim_rate: ">=0.90"
high_severity_safety_violations: "<=0.002"
correct_abstention_on_unanswerable: ">=0.85"
release_policy:
require_all_platform_slos: true
require_all_outcome_slos: true
block_on_regression:
task_success_core_eval_delta: "< -0.01"
grounded_claim_rate_delta: "< -0.02"
high_severity_safety_violations_delta: "> 0.000"
Weekly operating loop
1) Refresh eval dataset slices (core, edge, high-risk).
2) Run baseline vs candidate evals.
3) Review budget burn for each outcome SLO.
4) Decide: promote, hold, or rollback.
5) Log decision + evidence in release journal.
Trade-offs
- Pro: Catches “looks healthy but feels broken” regressions.
- Pro: Aligns engineering decisions with user trust outcomes.
- Pro: Makes release governance auditable.
- Con: Outcome labeling/eval maintenance adds ongoing cost.
- Con: Some SLIs (like helpfulness) need careful rubric design.
- Con: Overly strict gates can slow iteration if metrics are noisy.
References
- Google SRE Book — Service Level Objectives: https://sre.google/sre-book/service-level-objectives/
- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
- OpenAI Docs — Working with evals: https://platform.openai.com/docs/guides/evals
- Anthropic Docs — Reduce hallucinations: https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/reduce-hallucinations
Final Take
For LLM systems, uptime is table stakes, not reliability. Reliability means users consistently get useful and trustworthy outcomes. The teams that operationalize outcome SLOs now will ship faster and break trust less often.
Changelog
- 2026-04-13: Initial draft created and published.