Defining Failure: What Is MTTR, MTTF, and MTBF?

July 6, 2011 By Stephen 9 Comments

Most IT professionals are used to talking about uptime, downtime, and system failure. But not everyone is entirely clear on the definition of the terms widely used in the industry. What exactly differentiates “mean time to failure” from “mean time between failures”? And how does “mean time to repair” play into it? Let’s get some definitions straight!

Definition of a Failure

I suppose it is wise to begin by considering what exactly qualifies as a “failure.” Clearly, if the system is down, it has failed. But what about the system running in degraded mode, such as a raid array that is rebuilding? And what about systems that are intentionally brought off-line?

Technically speaking, a failure is declared when the system does not meet its desired objectives. When comes to IT systems, including disk storage, this generally means an outage or down time. But I have experienced situations where the system was running so slowly that it should be considered failed even though it was technically still “up.” Therefore, I consider any system that cannot meet minimum performance or availability requirements to be “failed.”

Similarly, a return to normal operations signals the end of downtime or system failure. Perhaps the system is still in a degraded mode, with some nodes or data protection systems not yet online, but if it is available for normal use I would consider it to be “non-failed.”

MTTR, MTTF, and MTBF — MTBF is the sum of MTTR and MTTF

Mean Time to Failure (MTTF)

The first metric that we should understand is the time that a system is not failed, or is available. Often referred to as “uptime” in the IT industry, the length of time that a system is online between outages or failures can be thought of as the “time to failure” for that system.

For example, if I bring my RAID array online on Monday at noon and the system functions normally until a disk failure Friday at noon, it was “available” for exactly 96 hours. If this happens every week, with repairs lasting from Friday noon until Monday noon, I could average these numbers to reach a “mean time to failure” or “MTTF” of 96 hours. I would probably also call my system vendor and demand that they replace this horribly unreliable device!

Most systems only occasionally fail, so it is important to think of reliability in statistical terms. Manufacturers often run controlled tests to see how reliable a device is expected to be, and sometimes report these results to buyers. This is a good indication of the reliability of a device, as long as these manufacturer tests are reasonably accurate. Unfortunately, many vendors refer to this metric as “mean time between failure” (MTBF), which is incorrect as we shall soon see.

Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device by a good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or over 100 years. But no one should expect a given hard disk drive to last this long. In fact, disk replacement rate is much higher than disk failure rate!

Mean Time to Repair (MTTR)

Many vendors suppose that repairs are instantaneous or non-existent, but IT professionals know that this is not the case. In fact, I might still be a systems administrator if it wasn’t for the fact that I had to spend hours in freezing cold datacenters trying to repair failed systems! The amount of time required to repair a system and bring it back online is the “time to repair”, another critical metric.

In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves three days, or 72 hours, to get things operational again. Over time, we would come to expect a “mean time to repair” or “MTTR” of 72 hours for any typical failure. Again, we would be justified in complaining to the vendor at this point.

Repairs can be excruciating, but they often do not take anywhere near as long as this. In fact, most computer systems and devices are wonderfully reliable, with MTTF measured in months or years. But when things do go wrong, it can often take quite a while to diagnose, replace, or repair the failure. Even so, MTTR in IT systems tends to be measured in hours rather than days.

Mean Time Between Failures (MTBF)

The most common failure related metric is also mostly used incorrectly. “Mean time between failures” or “MTBF” refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of MTTF and MTTR, the total time required for a device to fail and that failure to be repaired.

For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72 hours would have an MTBF of one week, or 168 hours. But many disk drives only fail once in their life, and most never fail. So manufacturers don’t bother to talk about MTTR and instead use MTBF as a shorthand for average failure rate over time. In other words, “MTBF” often reflects the number of drives that fail rather than the rate at which they fail!

Stephen’s Stance

Most computer industry vendors use the term “MTBF” rather indiscriminately. But IT pros know that systems do not magically repair themselves, at least not yet, so MTTR and MTTF are just as important!

You might also want to read these other posts...

Comments

mobesuccess1 says

September 29, 2015 at 8:52 am

if i can do it you can do it better
just follow the link 1900+ per day
http://www.mytimespic.com/3
Ethan Wilson says

April 12, 2017 at 8:58 pm

Thanks for the explanation! I am finishing my Security+
Edhighere Mena Princewill says

August 5, 2017 at 8:12 pm

nice piece with most explanations here based on system and server admin but i could easily relate them to telecommunication field operation management…thanks
Hardlinelogic says

August 14, 2017 at 4:27 pm

Ha! Me Too! Good luck! did you pass?!
Tom Roltsch says

May 16, 2018 at 2:43 pm

Stephen,

You know IT but not reliability engineering. The sum of MTTF and MTTR is not MTBF. Mean time to failure (MTTF) is a metric used for non-repairable systems, like light bulbs, that have a useful life and then are discarded when they fail. Mean time between failure (MTBF) is used for repairable systems. It is the average operational time between failures. It is very important to note that this average is over the entire useful life of the system. Because MTBF is an average over the system lifetime, it is the most likely estimator for the rate in a homogenous Poisson process. Thus, both MTTF and MTBF are reciprocals of the failure rate for the non-repairable device or the repairable system. This allows us to calculate reliability (the probability that the device or system will not fail) for any time interval. Repairable systems have a failure function and a restore function. MTTF and MTBF are techniques used to estimate the failure function. Maintainability (MTTR) is the probability that the system will be restored to service in a given amount of time. It is a technique used to estimate the restore function. MTBF and MTTR represent two distinct processes in a Markov chain: failure and restore. Their sum is meaningless (e.g. MTBSI is a meaningless metric), The meaningful information about the failure function and the repair function is evident only when MTBF and MTTR are expressed separately.
meh says

July 3, 2018 at 6:46 pm

Is there a metric for “number of uses until failure”? I’m trying to figure out the metrics involved in software stability in general, and in some cases MTBF is a good metric where the activity is held over a period of time, but in other cases it would be useful to have a dedicated metric to repetitive actions until failures. Is there such a metric?
Thank you.
Tom j roltsch says

July 5, 2018 at 5:35 pm

Yes. Generally one uses the metric that is most meaningful for the system. Autos often use mean miles between failure and guns often use number of rounds fired between failures. Switches and relays use number of cycles between failure.
Aaron Alexis says

April 5, 2019 at 10:16 am

Thanks for sharing such an informative post!! This is a great article, and something I think needs to be communicated more often. Unexpected failures are one of the main causes of high maintenance cost. So, one must use effective downtime tracking software in their industry. Professionals are recommending Thrive’s OEE Tracking and downtime tracking software, it helps in reducing downtime and increasing the ROI.
nick says

May 28, 2019 at 10:38 pm

Ya your right this article is false and should be taken down

GPS Time Rollover Failures Keep Happening (But They’re Almost Done)

This is week “1111111111” in the GPS system. Tomorrow morning it will roll over to week “0000000000”. How well will various systems handle this change? Not well, judging by what we’ve seen so far!

Ranting and Raving About the 2018 iPad Pro

I remain enthusiastic about the iPad Pro, despite getting a scratched screen and my concerns about durability. It’s a worthy successor to the original and offers enough improvements that I’d recommend the upgrade for just about anyone who uses their iPad for serious work. It’s still not yet a laptop replacement, but this is due more to a lack of desktop-class software for iOS than anything in Apple’s control.

The Rack Endgame: Open Compute Project

September 17, 2014

On reading my thoughts about the evolution of enterprise storage, many pointed out that this looks an awful lot like the Facebook-led Open Compute Project (OCP). This is entirely intentional. But OCP is simply one expression of this new architecture, and perhaps not the best one for the enterprise.

Instapaper for iPad and iPhone Enhances My Web World

June 1, 2010

One of my favorite iPad and iPhone apps is Instapaper. Like the iPad itself, Instapaper seems almost foolishly simple and derivative until you experience it. Then it becomes something else entirely: A product so useful you may ask yourself “how did I ever get along without this?”

Microsoft: Kill the Craptops Before They Destroy Windows!

January 7, 2013

Release after release, Microsoft pushes Windows forward. Yet the operating system is continually undermined by the “value-focused” low-end machines pushed by the majority of OEMs. This race to the bottom has tarnished Windows for a decade and now threatens to derail Windows 8. Microsoft must do something to stop the crap before it’s too late!

Sony NEX-7 Hands-On Review Part 1: The New Super-Camera

March 26, 2012

It took over five months, but Sony finally delivered my NEX-7 kit on March 8 (my birthday, natch!). After using the camera for a few weeks, I can say it’s exactly what I hoped it would be: A worthy upgrade over the NEX-5, and perhaps the best enthusiast camera on the market. But it’s not without flaws, including some surprising shortcomings. Here’s my hands-on review!

Not All 802.11n Networks Are Alike

July 2, 2011

Buyers of 802.11n wireless network equipment should not assume they will see a great benefit right out of the box. Most will have to enable by hand a high-performance configuration including wide channels and 5 GHz operation. And some client devices may never reach the levels of performance expected by consumers due to hardware limitations.

A Fairy Tale of Two Storage Protocols

September 23, 2014

It’s clear how this fairy tale ends. So many companies are using “S3 plus” as their standard interface, and even inside their solutions, that it’s safe to say it’s won the cloud storage API battle. But S3 isn’t a finalized spec – the industry will extend and improve it over the coming years. Soon we’ll have a cloud storage standard based on S3, just like we have a LAN file services standard based on CIFS.

The Myths of Standardization

December 15, 2011

I certainly benefit from standardization of the world around me, and I welcome interoperability and interchangeability as well as the price and product selection advantages. But I am not blithely focused on standardization above all else. I will happily use a proprietary solution if the alternative is inelegant, ineffective, or insufficient.

What More Could Alan Turing Have Accomplished?

October 7, 2012

Many of you have probably heard the name of Alan Turing, but most of those probably don’t appreciate the extent of his contributions. To say that he invented the modern world is an overstatement, but he did dream up the computers we see around us today, and helped win World War II in the process. But the story of Alan Turing is as much about exclusion and defeat as it is of genius.