Quick Answer
Understanding the difference between SLA vs SLO vs SLI is one of the most important things a DevOps or SRE team can get right. SLI (Service Level Indicator) is the raw metric — latency, error rate, availability. SLO (Service Level Objective) is the internal target you set for that metric — 99.9% availability over a rolling 30-day window. SLA (Service Level Agreement) is the external contract that defines the consequences if you miss that target. Most teams confuse the three because they sound similar, but they operate at completely different layers of your reliability stack — and conflating them leads to objectives nobody can act on and agreements nobody can meet.
Key Takeaways
- SLIs measure what’s happening. SLOs define what should happen. SLAs define what happens if it doesn’t.
- Your SLO should always be stricter than your SLA — if you’re hitting your SLO consistently, you’re protecting your SLA automatically.
- Error budgets are derived from SLOs, not SLAs. You can’t build a meaningful error budget from a contractual obligation.
- Most SLA violations are caused not by the underlying system failing, but by SLOs that were never properly defined or tracked.
- Google’s SRE book popularized SLIs and SLOs. The framework is tool-agnostic — it works whether you’re running on-prem, cloud, or hybrid.
Why These Three Terms Get Confused — and Why It Costs You
SLA, SLO, and SLI appear in the same conversations, in the same documents, and often get used as if they’re interchangeable. They’re not — and the sla vs slo confusion in particular has real operational consequences.
A team that tracks SLIs without SLOs has plenty of data but no signal — they can see that latency is 240ms, but they don’t know whether that’s acceptable. A team that defines SLAs without SLOs has commitments without a mechanism to protect them — they’ve promised customers 99.9% uptime but have no internal target that would tell them when they’re trending toward a violation. And a team that sets SLOs without grounding them in measured SLIs is writing aspirational fiction, not operational targets.
The three work as a hierarchy. SLIs are the inputs. SLOs are the thresholds applied to those inputs. SLAs are the consequences attached to those thresholds. Each layer depends on the one below it being well-defined.
What Is an SLI (Service Level Indicator)?
A Service Level Indicator is a quantitative measurement of a specific aspect of a service’s behavior. It’s a number — a ratio, a rate, a latency percentile — that tells you something concrete about how the service is performing right now.
The most common SLIs fall into four categories:
Availability. The proportion of time the service responds successfully to requests. Typically expressed as a ratio: successful requests divided by total requests over a measurement window. Not uptime in the ICMP-ping sense — availability as an SLI measures whether the service is responding correctly, not just whether the machine is alive.
Latency. How long it takes the service to respond to a request. Almost always tracked at percentiles rather than averages — P50, P95, P99. Averages mask the tail latency that users actually experience during degradation events. A service with a 50ms average latency and a 2,000ms P99 latency is a service with a serious problem that the average conceals.
Error rate. The proportion of requests that result in an error response. Expressed as a percentage of total requests over a time window. The definition of “error” matters: a 500 HTTP response is clearly an error; a 200 response with an empty body might be an error depending on the service contract.
Throughput. The volume of requests the service handles per unit of time. Useful as an SLI for services where capacity constraints are the primary failure mode — queues, batch processors, data pipelines.
A well-defined SLI has four properties: it’s quantifiable, it’s measurable in near-real-time, it reflects something users actually experience, and it can be computed from data your system already produces. An SLI that requires a manual calculation process or a weekly report is not an operational SLI.
The four SLI categories — choose the ones that reflect what users actually experience
What Is an SLO (Service Level Objective)?
A Service Level Objective is an internal target for an SLI over a defined measurement window. It’s the answer to the question: what level of performance is acceptable for this service?
An SLO has three components: the SLI it applies to, the target threshold, and the measurement window.
Example: “P99 latency for the checkout API will be below 400ms for 99% of requests, measured over a rolling 28-day window.”
That’s a complete SLO. It names the SLI (P99 latency), the threshold (400ms), the compliance target (99% of requests), and the window (28 days). Remove any of those four elements and you have an aspiration, not an objective.
The measurement window matters more than most teams realize. A calendar-month window creates a hard reset on the first of every month — you start fresh whether you burned your entire error budget in the first week or cruised through with no incidents. A rolling window reflects continuous operational health more accurately. Most SRE teams use rolling 28 or 30-day windows for operational SLOs and calendar months for customer-facing reporting.
SLO Tightness: The Trade-off Nobody Talks About
Setting an SLO at 99.99% availability sounds responsible. It may not be. An SLO that’s too tight relative to your actual system capability will burn your error budget constantly, trigger alerts that your team learns to ignore, and create a culture of gaming the metric rather than improving the system.
The right SLO is set slightly above your system’s demonstrated performance over the last 90 days — tight enough to be meaningful, loose enough to leave room for planned maintenance and minor incidents without immediately exhausting the budget. Google’s SRE book recommends starting with an SLO that reflects actual historical performance before tightening it as reliability improves. The objective is to create a signal, not a constant alarm.
What Is an Error Budget?
An error budget is the amount of unreliability your SLO permits. If your SLO for availability is 99.9% over 30 days, your error budget is 0.1% of that window — about 43 minutes of downtime. You can use that budget for planned maintenance, risky deployments, or absorb unexpected incidents. When the budget is exhausted, the correct response is to freeze non-essential deployments and focus engineering effort on reliability improvements.
Error budgets shift the conversation from “who caused the outage” to “how much of our budget did this consume.” That’s a more productive framing for engineering teams because it connects reliability work directly to remaining operational capacity.
What Is an SLA (Service Level Agreement)?
A Service Level Agreement is a contract — external, formal, and consequential. It defines the service level a provider commits to delivering and what happens if that commitment is not met. The “what happens” part is what distinguishes an SLA from an SLO: SLAs have teeth. Violations typically trigger financial penalties, service credits, or contract termination clauses.
SLAs are negotiated with customers, not set internally by engineering teams. They reflect commercial and legal commitments, not purely technical ones. An SLA that commits to 99.9% monthly availability is a business decision as much as a technical one — your organization is accepting the financial consequences of missing that target.
Well-written SLAs define three things clearly: the metric being measured (the SLI), the threshold that constitutes a violation, and the measurement methodology. SLAs that define the threshold but not the measurement methodology are open to dispute — “99.9% availability” means nothing without agreement on how availability is measured, from where, over what window, and what constitutes an exclusion (planned maintenance, force majeure events, customer-caused failures).
SLA Gotchas Engineering Teams Discover Too Late
The measurement methodology is everything. If your SLA defines availability as measured from the provider’s infrastructure but your customer measures it from their location, you can both be right while disagreeing completely. Specify measurement point, measurement frequency, and data source in the contract — not just the threshold.
Exclusions need explicit enumeration. “Planned maintenance windows” is not a sufficient exclusion clause. How much planned maintenance? Announced how far in advance? During what time window? Vague exclusions become disputes.
Penalties need to scale appropriately. A service credit of 10% of monthly fees for a 99.9% SLA violation is a mild financial incentive to improve. For customers whose businesses depend on your service, that credit may not come close to their actual cost of downtime. Understand what your customers’ downtime costs them — it informs how seriously to take your own SLA commitments.
SLA vs SLO vs SLI: The Key Differences at a Glance
SLI → SLO → SLA: each layer depends on the one below being well-defined
The Relationship Between SLI, SLO, and SLA in Practice
The three work as a stack, not as parallel tracks. Here is how they connect in a real production environment.
Your monitoring system collects the SLI continuously — P99 latency for your API, measured every 60 seconds. That SLI is evaluated against your SLO — P99 below 400ms for 99.5% of requests over a rolling 28-day window. When the SLI shows that the P99 is trending toward 400ms, your SLO compliance is eroding and your error budget is burning. That’s the signal that triggers engineering action — not because an SLA violation is imminent, but because your internal target says this needs attention now.
The SLA sits above this. If your SLO is 99.5% P99 compliance and your SLA commits to 99% P99 compliance, you have a 0.5% buffer between your internal target and your external commitment. When you protect your SLO, you’re automatically protecting your SLA. When SLO violations start becoming frequent, you’re burning through that buffer — and SLA violations follow.
The practical implication: SLOs are your early warning system. SLAs are your last line of defense. Teams that only track SLA compliance are finding out about reliability problems after they’ve already affected customers and triggered contractual consequences. Teams that track SLO compliance are finding out while they still have time to act.
How to Define Good SLOs: A Step-by-Step Approach
SLO definition is a five-step process — skipping any step produces a target that can’t be operationalized
Step 1: Choose the right SLI. Start with what users actually experience. For a web API, that’s almost always availability and latency. For a data pipeline, it’s data freshness and completeness. For a messaging queue, it’s throughput and delivery time. Pick the SLI that, if it degraded, users would notice first.
Step 2: Measure your actual baseline. Run your chosen SLI measurement for 90 days without setting any targets. Understand what “normal” looks like for your system. What’s the typical P99 latency? What’s your actual availability rate? You cannot set a meaningful target without knowing your baseline.
Step 3: Set a target slightly above the baseline. If your baseline availability is 99.87%, an SLO of 99.9% is reasonable. An SLO of 99.99% is not — it sets your team up to fail constantly. An SLO that you breach every month is not an objective; it’s a recurring frustration. Start with a target that reflects demonstrated performance, then tighten it as you make reliability improvements.
Step 4: Define the measurement window and error budget. Choose rolling 28 days for operational SLOs. Calculate the error budget: for a 99.9% availability SLO over 28 days, that’s 40.3 minutes of permitted downtime. Document what exhausting the error budget means for your team — which typically means halting risky deployments until the budget recovers.
Step 5: Wire the SLO to your alerting and escalation system. An SLO that lives in a document but doesn’t trigger any operational response is useless. Configure alerts that fire when error budget burn rate is high — not just when the SLO is already breached. Burn rate alerting tells you that you’re on track to exhaust the budget, giving you time to act before the SLO window closes. Connect those alerts to your on-call schedule so the right engineer gets the notification.
SLO vs SLI: The Specific Distinction
This is the comparison that trips up engineers most often, so it’s worth being precise.
An SLI is a fact about what your system is doing right now. “Our API returned a 500 error for 0.08% of requests in the last hour” is an SLI measurement. It tells you the current state, nothing more.
An SLO is a judgment about whether that fact is acceptable. “Our error rate SLO requires that fewer than 0.1% of requests return errors over a 28-day window” is the SLO. Comparing the SLI measurement to the SLO threshold is what generates an operational signal.
The SLI has no opinion. The SLO is all opinion — it encodes your team’s judgment about what level of reliability is acceptable for this service. Two teams running identical systems could have very different SLOs depending on their user base, their error budget policies, and the commercial commitments they’ve made.
The practical implication: you can have an SLI without an SLO (you’re collecting data with no target), but you cannot have a meaningful SLO without a measurable SLI. The SLI comes first.
SLA vs SLO: Why Your SLO Should Always Be Stricter
This is the most important relationship in the entire framework. Your SLO should be set at a level that, if maintained, guarantees your SLA is never breached.
If your SLA commits to 99.9% monthly availability, your SLO should target 99.95% or higher. The gap between the two is your safety margin — the buffer that absorbs the difference between your internal operational reality and your external commitment.
The gap between your SLO and SLA is your operational safety margin — protect it deliberately
Why does this matter in practice? Because systems don’t fail cleanly. When an incident occurs, you typically don’t know immediately whether it’s going to last two minutes or two hours. If your SLO and SLA are set at the same threshold, any incident that burns your SLO compliance is simultaneously threatening your SLA. You have no buffer, no room for the incident to be investigated and resolved without immediately triggering contractual consequences.
The buffer between SLO and SLA is the space in which your engineering team operates. Protect it by treating SLO compliance as the operational priority, and let SLA compliance be the natural consequence.
Common SLI, SLO, and SLA Mistakes
Tracking too many SLIs. If you have 40 SLIs, you have noise, not signal. Pick the three to five metrics that most directly reflect the user experience for each critical service. Everything else is supporting context, not a primary SLI.
Setting SLOs without baselines. An SLO of 99.99% availability sounds good. If your system has never actually achieved 99.99% availability for a sustained period, you’ve set a target you’ll breach immediately. Measure first, set targets second.
SLOs that don’t trigger anything. An SLO stored in a spreadsheet that nobody checks is a compliance theater exercise. Every SLO needs to be wired to a burn rate alert, which is wired to an escalation policy, which is wired to your on-call rotation. The chain has to be complete or the SLO is decorative.
Confusing SLA compliance reporting with reliability measurement. Your SLA might only require monthly availability reporting. That doesn’t mean you should only be measuring availability monthly. SLIs are continuous operational data. SLA reporting is a periodic summary of that data for external stakeholders. The cadences are completely different.
SLAs that lack measurement methodology. “99.9% uptime” without specifying how uptime is measured, from where, and how maintenance windows are counted is a contract waiting to be disputed. Define the methodology in the SLA, not just the threshold.
Treating SLA compliance as the reliability goal. SLA compliance means you didn’t violate your contract. It does not mean your service is reliable. A service can deliver exactly 99.9% availability — meeting its SLA — while consistently failing 0.1% of users, causing real disruption. SLOs set higher than the SLA threshold are what drive continuous reliability improvement beyond mere contractual minimums.
SLI, SLO, SLA Reference: Availability Targets and Error Budgets
| Availability SLO | Downtime per year | Downtime per month | Error budget (30 days) | Realistic for |
|---|---|---|---|---|
| 99% (“two nines”) | 87.6 hours | 7.3 hours | 7.3 hours | Internal tools, dev environments |
| 99.5% | 43.8 hours | 3.6 hours | 3.6 hours | Non-critical production services |
| 99.9% (“three nines”) | 8.7 hours | 43.8 minutes | 43.8 minutes | Standard production SLAs |
| 99.95% | 4.4 hours | 21.9 minutes | 21.9 minutes | SLO buffer above a 99.9% SLA |
| 99.99% (“four nines”) | 52.6 minutes | 4.4 minutes | 4.4 minutes | High-availability production, payments |
| 99.999% (“five nines”) | 5.3 minutes | 26.3 seconds | 26.3 seconds | Telco, critical infrastructure |
Note: “Five nines” is rarely achievable for application-layer services without significant architectural investment. Most production web services operate between 99.9% and 99.99%.
How SLOs Connect to Incident Management
SLOs don’t operate in isolation. They are inputs to your incident management process — specifically, SLO burn rate is one of the most actionable signals for triggering and prioritizing incident response.
When an SLO starts burning — when the error budget consumption rate is higher than the rate that would exhaust the budget by the end of the window — that’s the signal to open an incident. The burn rate tells you how urgent the response needs to be. A burn rate of 1x (consuming budget at exactly the rate that would exhaust it by window end) warrants attention. A burn rate of 14.4x (consuming 30 days of budget in 2 hours) warrants an immediate P1 incident declaration and full incident response.
Google’s SRE workbook introduced the concept of multi-window burn rate alerting: alert when burn rate is high over both a short window (1 hour) and a longer window (6 hours). The short window catches fast-burning incidents; the long window catches slow burns that accumulate over time. Both matter — a fast burn that recovers quickly costs you less error budget than a slow burn that runs for days.
Connecting SLO data to your alerting system and on-call rotation closes the loop between reliability measurement and operational response. SLOs measured but not acted on are incomplete. The full system is: measure SLI → evaluate against SLO → alert on burn rate → route to on-call → resolve → post-incident review → adjust SLO if needed.
Frequently Asked Questions
What is the difference between SLA and SLO?
An SLA (Service Level Agreement) is an external contract with customers that defines the service level committed and the consequences of missing it — typically financial penalties or service credits. An SLO (Service Level Objective) is an internal target set by your engineering or SRE team that defines what reliability level you’re aiming for. SLOs should always be stricter than the corresponding SLA — protecting your SLO automatically protects your SLA.
What is the difference between SLI and SLO?
An SLI (Service Level Indicator) is the raw measurement — the actual latency, error rate, or availability figure your system is producing right now. An SLO is the target applied to that measurement — the threshold that determines whether your SLI reading is acceptable. You can have an SLI without an SLO (data with no target), but you can’t have a meaningful SLO without a measurable SLI. The SLI comes first.
What is an error budget?
An error budget is the amount of unreliability your SLO permits over the measurement window. If your availability SLO is 99.9% over 30 days, your error budget is 0.1% of that window — about 43 minutes. Error budgets make reliability trade-offs concrete: spending the budget on a risky deployment is a deliberate choice, not an accident. Exhausting the budget is a signal to halt deployments and focus on reliability work.
How do I set an SLO?
Start by measuring your actual SLI for at least 90 days without setting any target. Understand your baseline performance. Then set an SLO slightly above that baseline — achievable with current system behavior, but tight enough to signal when things are degrading. Don’t set aspirational targets; set targets that reflect demonstrated performance and tighten them as reliability improves. The SLO must be wired to a burn rate alert and an on-call escalation to be operationally useful.
What is SLO burn rate?
Burn rate is the speed at which you’re consuming your error budget. A burn rate of 1x means you’re consuming budget at exactly the rate that would exhaust it by window end — acceptable but worth monitoring. A burn rate of 14.4x means you’re consuming 30 days of budget in 50 hours — a P1 incident. Burn rate alerting fires before your SLO is breached, giving you time to respond while you still have error budget remaining.
Do all services need SLAs?
Not all services need formal SLAs — internal tools, development environments, and non-customer-facing systems typically don’t require contractual commitments. But all production services that users depend on should have SLOs, regardless of whether there’s a formal SLA. SLOs are how you operationalize reliability. SLAs are how you make commercial commitments. Internal services need the former; only customer-facing services typically need the latter.
What’s the difference between SLO and KPI?
A KPI (Key Performance Indicator) is a broad business metric — customer satisfaction score, revenue, churn rate. An SLO is a specific, technically-defined reliability target for a service. KPIs are business health indicators; SLOs are operational reliability targets. They sit at different levels of the organization but can be connected: SLO compliance feeds into the reliability KPIs that business stakeholders track. See our guide on incident management KPIs for how these connect in practice.
Conclusion
SLIs, SLOs, and SLAs are not three ways of saying the same thing. They’re three distinct layers of a reliability framework, each with a different owner, a different purpose, and a different set of consequences when ignored.
SLIs tell you what’s happening. SLOs tell you whether what’s happening is acceptable. SLAs tell you what it costs when it’s not. Get the hierarchy right — ground your SLOs in measured SLIs, set your SLOs stricter than your SLAs, and wire your SLOs to your alerting and incident response system — and you have a framework that surfaces reliability problems before they become customer-visible failures.
The teams that maintain high reliability don’t monitor their SLAs. They monitor their SLOs, protect their error budgets, and treat SLA compliance as the natural outcome of that work. The SLA is the floor. The SLO is the standard you actually hold yourself to.