October 25, 2014

Defining Failure: What Is MTTR, MTTF, and MTBF?

Most IT professionals are used to talking about uptime, downtime, and system failure. But not everyone is entirely clear on the definition of the terms widely used in the industry. What exactly differentiates “mean time to failure” from “mean time between failures”? And how does “mean time to repair” play into it? Let’s get some definitions straight!

Definition of a Failure

I suppose it is wise to begin by considering what exactly qualifies as a “failure.” Clearly, if the system is down, it has failed. But what about the system running in degraded mode, such as a raid array that is rebuilding? And what about systems that are intentionally brought off-line?

Technically speaking, a failure is declared when the system does not meet its desired objectives. When comes to IT systems, including disk storage, this generally means an outage or down time. But I have experienced situations where the system was running so slowly that it should be considered failed even though it was technically still “up.” Therefore, I consider any system that cannot meet minimum performance or availability requirements to be “failed.”

Similarly, a return to normal operations signals the end of downtime or system failure. Perhaps the system is still in a degraded mode, with some nodes or data protection systems not yet online, but if it is available for normal use I would consider it to be “non-failed.”

MTBF is the sum of MTTR and MTTF

Mean Time to Failure (MTTF)

The first metric that we should understand is the time that a system is not failed, or is available. Often referred to as “uptime” in the IT industry, the length of time that a system is online between outages or failures can be thought of as the “time to failure” for that system.

For example, if I bring my RAID array online on Monday at noon and the system functions normally until a disk failure Friday at noon, it was “available” for exactly 96 hours. If this happens every week, with repairs lasting from Friday noon until Monday noon, I could average these numbers to reach a “mean time to failure” or “MTTF” of 96 hours. I would probably also call my system vendor and demand that they replace this horribly unreliable device!

Most systems only occasionally fail, so it is important to think of reliability in statistical terms. Manufacturers often run controlled tests to see how reliable a device is expected to be, and sometimes report these results to buyers. This is a good indication of the reliability of a device, as long as these manufacturer tests are reasonably accurate. Unfortunately, many vendors refer to this metric as “mean time between failure” (MTBF), which is incorrect as we shall soon see.

Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device by a good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or over 100 years. But no one should expect a given hard disk drive to last this long. In fact, disk replacement rate is much higher than disk failure rate!

Mean Time to Repair (MTTR)

Many vendors suppose that repairs are instantaneous or non-existent, but IT professionals know that this is not the case. In fact, I might still be a systems administrator if it wasn’t for the fact that I had to spend hours in freezing cold datacenters trying to repair failed systems! The amount of time required to repair a system and bring it back online is the “time to repair”, another critical metric.

In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves three days, or 72 hours, to get things operational again. Over time, we would come to expect a “mean time to repair” or “MTTR” of 72 hours for any typical failure. Again, we would be justified in complaining to the vendor at this point.

Repairs can be excruciating, but they often do not take anywhere near as long as this. In fact, most computer systems and devices are wonderfully reliable, with MTTF measured in months or years. But when things do go wrong, it can often take quite a while to diagnose, replace, or repair the failure. Even so, MTTR in IT systems tends to be measured in hours rather than days.

Mean Time Between Failures (MTBF)

The most common failure related metric is also mostly used incorrectly. “Mean time between failures” or “MTBF” refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of MTTF and MTTR, the total time required for a device to fail and that failure to be repaired.

For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72 hours would have an MTBF of one week, or 168 hours. But many disk drives only fail once in their life, and most never fail. So manufacturers don’t bother to talk about MTTR and instead use MTBF as a shorthand for average failure rate over time. In other words, “MTBF” often reflects the number of drives that fail rather than the rate at which they fail!

Stephen’s Stance

Most computer industry vendors use the term “MTBF” rather indiscriminately. But IT pros know that systems do not magically repair themselves, at least not yet, so MTTR and MTTF are just as important!