Every engineering team eventually faces the same question during a chaotic outage: does this actually count as an incident? An incident is any unplanned event that disrupts a service, degrades its quality, or threatens its normal operation and requires a response to restore it. Getting this definition right is not academic. It determines when your team pages someone at 3 a.m., which severity level gets assigned, and whether the event feeds into your reliability metrics at all.
This guide is the foundation of our incident management library. It covers what qualifies as an incident, how incidents differ from events, problems, and outages, the incident lifecycle from detection to postmortem, and the roles and metrics that turn incident chaos into a repeatable process.
What Is an Incident?
In IT operations, an incident is an unplanned interruption to a service or a reduction in the quality of a service. That phrasing comes from ITIL 4, the most widely adopted service management framework, and it has two important implications:
- Full outages are not required. A checkout page that loads in 12 seconds instead of 2 is an incident, even though the service is technically “up.” Degradation counts.
- A response is implied. If nothing needs to be done and no one needs to be informed, it is an event, not an incident.
Security teams use a stricter definition. NIST defines an incident as “an occurrence that actually or potentially jeopardizes the confidentiality, integrity, or availability of an information system” or violates security policies. The common thread across both worlds: something unplanned happened, it threatens normal operation, and someone has to act.
Site reliability engineering adds a pragmatic filter. In the Google SRE book, an event becomes a managed incident when it needs coordination: multiple responders, customer-visible impact, or an unclear resolution path. A single engineer restarting a stuck worker is operational noise; the same failure cascading across regions is an incident.
Incident vs Event vs Alert vs Problem vs Outage
These five terms get used interchangeably in war rooms, and the confusion has real costs: misrouted pages, skipped postmortems, and metrics that measure the wrong thing.
| Term | What it means | Example |
|---|---|---|
| Event | Any observable change in system state. Most are harmless. | A server logs a completed cron job |
| Alert | A notification that an event crossed a threshold and may need attention. | CPU above 90% for 5 minutes |
| Incident | An unplanned disruption or degradation that requires a response. | API error rate spikes, checkout failing for 20% of users |
| Problem | The underlying root cause of one or more incidents. | A memory leak that has caused three crashes this month |
| Outage | A complete loss of service availability; the most severe form of incident. | The entire platform is unreachable |
The incident-versus-problem distinction matters most for how work gets prioritized: incidents are about restoring service now, problems are about making sure it does not happen again. We break down that split, and when to run each process, in incident management vs problem management. And because not every alert deserves to become an incident, tuning what actually pages a human is its own discipline; see our guide to alert noise.
Types of Incidents
Most teams encounter four broad categories:
- Availability incidents. The service is down or unreachable: crashed processes, failed deployments, expired certificates, DNS failures. These drive your downtime numbers.
- Performance incidents. The service works but slowly: latency spikes, timeouts, degraded throughput. Often harder to detect than outages because nothing is obviously broken.
- Security incidents. Unauthorized access, data exposure, malware, denial-of-service attacks. These follow the NIST definition and usually trigger a separate response track with legal and compliance involvement.
- Data incidents. Corruption, loss, or incorrect processing of data: a migration that silently dropped rows, a pipeline writing bad values downstream.
A major incident cuts across these categories. It is any incident with severe business impact, wide customer visibility, or SLA implications that demands immediate, coordinated response, typically your SEV1 tier and a dedicated incident commander.
How Are Incidents Classified? Severity and Priority
Once something is declared an incident, two questions follow: how bad is it, and how fast must we act?
Severity measures technical and business impact. Most engineering teams use a SEV1 to SEV4 or SEV5 scale, where SEV1 means critical customer-facing failure and SEV4 covers minor issues with workarounds. Priority measures urgency of response, factoring in deadlines, affected customers, and business context. A SEV3 bug in an invoicing system might become top priority on the last day of the quarter.
Clear classification is the single highest-leverage piece of incident process, because everything else keys off it: who gets paged, how fast, and through which escalation policy. Our incident severity levels guide includes a ready-to-use classification matrix.
The Incident Lifecycle: From Detection to Learning

Every incident, from a five-minute blip to a multi-hour outage, moves through the same five stages. Each stage maps to a metric your team should track.
1. Detection
Monitoring, users, or automated checks surface the anomaly. The time this takes is your MTTD (mean time to detect), and it is the stage where the most resolution time is silently lost.
2. Response and triage
The right responder is identified and acknowledges the incident, measured as MTTA (mean time to acknowledge). This is where alert routing and on-call design earn their keep: an alert that reaches the wrong person adds minutes or hours before real work begins.
3. Diagnosis and mitigation
Responders identify the failing component and either fix it or apply a mitigation such as a rollback, failover, or feature flag. A well-maintained runbook is the difference between a 10-minute mitigation and an hour of improvisation. Throughout this stage, structured incident communication keeps stakeholders informed without pulling responders off the problem.
4. Resolution
Service is fully restored and verified. End-to-end recovery time is your MTTR (mean time to resolve), the metric that best summarizes how well the whole chain works under pressure.
5. Learning
The team runs a blameless postmortem to understand contributing causes and assign follow-up actions. Skipping this stage is how the same incident happens three more times. A documented incident response plan ties all five stages together so nobody is inventing process mid-outage.
Who Responds to an Incident?
Small incidents may need a single on-call engineer. Larger ones need defined roles:
- Incident commander owns coordination and decisions, not the keyboard. See our guide to the incident commander role.
- Responders / subject-matter experts do the technical investigation and remediation.
- Communications lead handles status updates to stakeholders and customers.
- Scribe keeps a timeline that later feeds the postmortem.
Behind all of these sits the on-call rotation that determines who is reachable at any hour. Sustainable rotations, fair schedules, and fast mobile alerting are covered in our guide to on-call management.
How Are Incidents Measured?
You cannot improve what you do not measure. The core incident metrics form a family, each covering one slice of the lifecycle:
| Metric | Measures | Lifecycle stage |
|---|---|---|
| MTTD | Time from failure start to detection | Detection |
| MTTA | Time from alert to acknowledgment | Response |
| MTTR | Time from start to full resolution | Resolution |
| MTBF | Average time between failures | Prevention |
Definitions, formulas, and benchmarks for the whole family live in our SRE metrics glossary, and the business-level view is covered in incident management KPIs.
From Definition to Practice
Knowing what an incident is only pays off when the definition is wired into your process: agreed severity levels, an escalation policy that reaches the right person in seconds, runbooks for known failure modes, and a postmortem habit that turns each incident into prevention. Our incident management best practices guide covers how mature teams assemble those pieces.
Tooling matters too, because during an incident every lost minute compounds. An on-call incident management and orchestration platform like itoc360 connects your monitoring stack to on-call schedules, routes alerts through escalation policies, and delivers them on a mobile app fast enough for a 3 a.m. response, so the gap between “something broke” and “the right human is working on it” stays as small as possible.
Frequently Asked Questions
What is an incident in simple terms?
An incident is anything unplanned that breaks or degrades a service and needs someone to respond. If a system stops working correctly and action is required to restore it, that is an incident.
What is the difference between an incident and a problem?
An incident is the disruption itself; a problem is its underlying root cause. Incident management restores service quickly, while problem management investigates and eliminates the cause so the incident does not recur.
Is every outage an incident?
Yes, every outage is an incident, but not every incident is an outage. An outage is a complete loss of availability, while incidents also include partial failures and performance degradation where the service still technically works.
What qualifies as a major incident?
A major incident is one with severe business impact: widespread customer-facing failure, significant revenue or SLA exposure, or security implications. Most teams map it to their SEV1 severity tier and respond with a dedicated incident commander and formal communication process.
What is an incident in cyber security?
In security contexts, NIST defines an incident as an occurrence that actually or potentially jeopardizes the confidentiality, integrity, or availability of a system, or violates security policies. Security incidents usually follow a dedicated response process alongside legal and compliance teams.
How do you know when to declare an incident?
Declare an incident when there is customer-visible impact, when you need more than one person to investigate, or when you are unsure how bad it is. Declaring early is cheap; declaring late costs resolution time. Clear severity definitions remove the guesswork.