Mert Tosun
← Posts
SLO, SLI, and Error Budget: Operating Service Reliability

SLO, SLI, and Error Budget: Operating Service Reliability

Mert TosunDevOps

Reliability gets ambiguous when teams use different definitions of "healthy." SRE resolves this by turning reliability into measurable agreements: SLIs to observe behavior, SLOs to define targets, and error budgets to guide delivery speed versus stability.

Core definitions

  • SLI: what you measure (for example success rate or latency)
  • SLO: target level over a time window
  • SLA: customer-facing contractual commitment

Example:

  • SLI: successful request ratio
  • SLO: 99.9% over 30 days

Error budget in practice

30-day window = 43,200 minutes
SLO 99.9% => error budget 0.1%
Allowed user-impact downtime = 43.2 minutes/month

If your burn rate is too high, feature rollout should slow down until reliability returns to safe levels.

Choose user-centric indicators

Good indicators:

  • API success ratio
  • p95 latency on critical endpoints
  • checkout completion rate

Weak indicators:

  • average CPU only
  • pod count without user impact

Metrics should represent actual user experience, not only infrastructure internals.

Alerting model

Use multi-window, multi-burn-rate alerts:

Short window + high burn rate -> page now
Long window + medium burn rate -> warning

This catches fast incidents early while reducing noisy alerts.

Process integration

Reliability policy should be explicit:

  1. Budget healthy -> normal release speed.
  2. Budget draining fast -> smaller rollout steps.
  3. Budget exhausted -> prioritize reliability work.

Conclusion

SLO/SLI discipline gives teams a shared language for reliability and removes guesswork from release decisions. Error budgets are not just reporting numbers; they are operational guardrails that help teams move fast without losing production stability.