• Skip to main content
  • Skip to primary sidebar
  • Home
  • About
    • Stephen Foskett
      • My Publications
        • Urban Forms in Suburbia: The Rise of the Edge City
      • Storage Magazine Columns
      • Whitepapers
      • Multimedia
      • Speaking Engagements
    • Services
    • Disclosures
  • Categories
    • Apple
    • Ask a Pack Rat
    • Computer History
    • Deals
    • Enterprise storage
    • Events
    • Personal
    • Photography
    • Terabyte home
    • Virtual Storage
  • Guides
    • The iPhone Exchange ActiveSync Guide
      • The iPhone Exchange ActiveSync Troubleshooting Guide
    • The iPad Exchange ActiveSync Guide
      • iPad Exchange ActiveSync Troubleshooting Guide
    • Toolbox
      • Power Over Ethernet Calculator
      • EMC Symmetrix WWN Calculator
      • EMC Symmetrix TimeFinder DOS Batch File
    • Linux Logical Volume Manager Walkthrough
  • Calendar

Stephen Foskett, Pack Rat

Understanding the accumulation of data

You are here: Home / Everything / Computer History / Defining Failure: What Is MTTR, MTTF, and MTBF?

Defining Failure: What Is MTTR, MTTF, and MTBF?

July 6, 2011 By Stephen 6 Comments

Most IT professionals are used to talking about uptime, downtime, and system failure. But not everyone is entirely clear on the definition of the terms widely used in the industry. What exactly differentiates “mean time to failure” from “mean time between failures”? And how does “mean time to repair” play into it? Let’s get some definitions straight!

Definition of a Failure

I suppose it is wise to begin by considering what exactly qualifies as a “failure.” Clearly, if the system is down, it has failed. But what about the system running in degraded mode, such as a raid array that is rebuilding? And what about systems that are intentionally brought off-line?

Technically speaking, a failure is declared when the system does not meet its desired objectives. When comes to IT systems, including disk storage, this generally means an outage or down time. But I have experienced situations where the system was running so slowly that it should be considered failed even though it was technically still “up.” Therefore, I consider any system that cannot meet minimum performance or availability requirements to be “failed.”

Similarly, a return to normal operations signals the end of downtime or system failure. Perhaps the system is still in a degraded mode, with some nodes or data protection systems not yet online, but if it is available for normal use I would consider it to be “non-failed.”

MTBF is the sum of MTTR and MTTF

Mean Time to Failure (MTTF)

The first metric that we should understand is the time that a system is not failed, or is available. Often referred to as “uptime” in the IT industry, the length of time that a system is online between outages or failures can be thought of as the “time to failure” for that system.

For example, if I bring my RAID array online on Monday at noon and the system functions normally until a disk failure Friday at noon, it was “available” for exactly 96 hours. If this happens every week, with repairs lasting from Friday noon until Monday noon, I could average these numbers to reach a “mean time to failure” or “MTTF” of 96 hours. I would probably also call my system vendor and demand that they replace this horribly unreliable device!

Most systems only occasionally fail, so it is important to think of reliability in statistical terms. Manufacturers often run controlled tests to see how reliable a device is expected to be, and sometimes report these results to buyers. This is a good indication of the reliability of a device, as long as these manufacturer tests are reasonably accurate. Unfortunately, many vendors refer to this metric as “mean time between failure” (MTBF), which is incorrect as we shall soon see.

Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device by a good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or over 100 years. But no one should expect a given hard disk drive to last this long. In fact, disk replacement rate is much higher than disk failure rate!

Mean Time to Repair (MTTR)

Many vendors suppose that repairs are instantaneous or non-existent, but IT professionals know that this is not the case. In fact, I might still be a systems administrator if it wasn’t for the fact that I had to spend hours in freezing cold datacenters trying to repair failed systems! The amount of time required to repair a system and bring it back online is the “time to repair”, another critical metric.

In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves three days, or 72 hours, to get things operational again. Over time, we would come to expect a “mean time to repair” or “MTTR” of 72 hours for any typical failure. Again, we would be justified in complaining to the vendor at this point.

Repairs can be excruciating, but they often do not take anywhere near as long as this. In fact, most computer systems and devices are wonderfully reliable, with MTTF measured in months or years. But when things do go wrong, it can often take quite a while to diagnose, replace, or repair the failure. Even so, MTTR in IT systems tends to be measured in hours rather than days.

Mean Time Between Failures (MTBF)

The most common failure related metric is also mostly used incorrectly. “Mean time between failures” or “MTBF” refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of MTTF and MTTR, the total time required for a device to fail and that failure to be repaired.

For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72 hours would have an MTBF of one week, or 168 hours. But many disk drives only fail once in their life, and most never fail. So manufacturers don’t bother to talk about MTTR and instead use MTBF as a shorthand for average failure rate over time. In other words, “MTBF” often reflects the number of drives that fail rather than the rate at which they fail!

Stephen’s Stance

Most computer industry vendors use the term “MTBF” rather indiscriminately. But IT pros know that systems do not magically repair themselves, at least not yet, so MTTR and MTTF are just as important!

You might also want to read these other posts...

  • GPS Time Rollover Failures Keep Happening (But…
  • Electric Car Over the Internet: My Experience Buying…
  • Ranting and Raving About the 2018 iPad Pro
  • Introducing Rabbit: I Bought a Cloud!
  • Tortoise or Hare? Nvidia Jetson TK1

Filed Under: Computer History, Enterprise storage, Features, Personal Tagged With: failure, metrics, MTBF, MTTF, MTTR, server management, storage management

Primary Sidebar

For the past 33 years, I have looked in the mirror every morning and asked myself: “If today were the last day of my life, would I want to do what I am about to do today?” And whenever the answer has been “No” for too many days in a row, I know I need to change something.

Steve Jobs

Subscribe via Email

Subscribe via email and you will receive my latest blog posts in your inbox. No ads or spam, just the same great content you find on my site!
 New posts (daily)
 Where's Stephen? (weekly)

Download My Book


Download my free e-book:
Essential Enterprise Storage Concepts!

Recent Posts

Electric Car Over the Internet: My Experience Buying From Vroom

November 28, 2020

Powering Rabbits: The Mean Well LRS-350-12 Power Supply

October 18, 2020

Tortoise or Hare? Nvidia Jetson TK1

September 22, 2020

Running Rabbits: More About My Cloud NUCs

September 21, 2020

Introducing Rabbit: I Bought a Cloud!

September 10, 2020

Remove ROM To Use LSI SAS Cards in HPE Servers

August 23, 2020

Test Your Wi-Fi with iPerf for iOS

July 9, 2020

Liberate Wi-Fi Smart Bulbs and Switches with Tasmota!

May 29, 2020

What You See and What You Get When You Follow Me

May 28, 2019

GPS Time Rollover Failures Keep Happening (But They’re Almost Done)

April 6, 2019

Symbolic Links

    Featured Posts

    New England Takes On Seattle To Determine Who’s Number 2 … In Tech!

    January 19, 2015

    Why You Should Never Again Utter The Word, “CIFS”

    February 16, 2012

    Faster Ethernet Gets Weird

    June 19, 2015

    Review: Blue Snowball USB Microphone

    March 31, 2010

    The Prime Directive of Storage: Do Not Lose Data

    December 12, 2014

    Why Buy a NEX-7? Why Sony NEX At All?

    October 17, 2011

    Introducing Rabbit: I Bought a Cloud!

    September 10, 2020

    The I/O Blender Part 1: Ye Olde Storage I/O Path

    May 23, 2012

    Thoughts on the Modern Miracle of 3D Printing

    July 28, 2015

    Scaling Storage In Conventional Arrays

    November 19, 2013

    Copyright © 2021 · Log in