Mean time to recovery (MTTR)
Mean time to recovery averages how long it takes to restore service after a failure; scope (what counts as “recovery”) must be defined with your incident process.
These pages explain common engineering and delivery metrics in plain language. Definitions vary by company, toolchain, and industry; we highlight typical usage and caveats. Nothing here is legal, financial, or professional advice, and it is not a substitute for judgment in your own context.
Metrics can be misused for surveillance or stack ranking. We do not recommend using them that way. DORA performance bands from research are contextual—not targets for individuals or hiring decisions.
See the Engineering metrics glossary hub for all terms.
Definition
Mean time to recovery (MTTR) in the DORA sense is the average time to restore service when a production incident or defect affects users—often from detection or declaration of impact to full restoration of normal operation.
Lower MTTR usually reflects good runbooks, observability, rollback paths, and on-call practices. It complements change failure rate: rare failures still hurt if recovery takes days.
How teams typically measure it
- Incident management tools: timestamps from impact start (or incident opened) to resolved per your severity matrix.
- Some teams measure time to mitigate separately from time to fully remediate root cause; pick one definition and stick to it.
- Exclude scheduled maintenance unless your framework explicitly includes it.
Common pitfalls
- Mixing infrastructure blips with application regressions without segmentation—different root causes need different responses.
- Averaging MTTR across widely different incident severities can hide chronic small outages or rare catastrophic ones.
Related terms
Browse other entries in the glossary.