CodeKuduCodeKudu

Mean time to recovery (MTTR)

Mean time to recovery averages how long it takes to restore service after a failure; scope (what counts as “recovery”) must be defined with your incident process.

Definition

Mean time to recovery (MTTR) in the DORA sense is the average time to restore service when a production incident or defect affects users—often from detection or declaration of impact to full restoration of normal operation.

Lower MTTR usually reflects good runbooks, observability, rollback paths, and on-call practices. It complements change failure rate: rare failures still hurt if recovery takes days.

How teams typically measure it

  • Incident management tools: timestamps from impact start (or incident opened) to resolved per your severity matrix.
  • Some teams measure time to mitigate separately from time to fully remediate root cause; pick one definition and stick to it.
  • Exclude scheduled maintenance unless your framework explicitly includes it.

Common pitfalls

  • Mixing infrastructure blips with application regressions without segmentation—different root causes need different responses.
  • Averaging MTTR across widely different incident severities can hide chronic small outages or rare catastrophic ones.

Related terms

Browse other entries in the glossary.