Defining Failure: What Is MTTR, MTTF, and MTBF?

July 6, 2011 By Stephen 9 Comments

Most IT professionals are used to talking about uptime, downtime, and system failure. But not everyone is entirely clear on the definition of the terms widely used in the industry. What exactly differentiates “mean time to failure” from “mean time between failures”? And how does “mean time to repair” play into it? Let’s get some definitions straight!

Definition of a Failure

I suppose it is wise to begin by considering what exactly qualifies as a “failure.” Clearly, if the system is down, it has failed. But what about the system running in degraded mode, such as a raid array that is rebuilding? And what about systems that are intentionally brought off-line?

Technically speaking, a failure is declared when the system does not meet its desired objectives. When comes to IT systems, including disk storage, this generally means an outage or down time. But I have experienced situations where the system was running so slowly that it should be considered failed even though it was technically still “up.” Therefore, I consider any system that cannot meet minimum performance or availability requirements to be “failed.”

Similarly, a return to normal operations signals the end of downtime or system failure. Perhaps the system is still in a degraded mode, with some nodes or data protection systems not yet online, but if it is available for normal use I would consider it to be “non-failed.”

MTTR, MTTF, and MTBF — MTBF is the sum of MTTR and MTTF

Mean Time to Failure (MTTF)

The first metric that we should understand is the time that a system is not failed, or is available. Often referred to as “uptime” in the IT industry, the length of time that a system is online between outages or failures can be thought of as the “time to failure” for that system.

For example, if I bring my RAID array online on Monday at noon and the system functions normally until a disk failure Friday at noon, it was “available” for exactly 96 hours. If this happens every week, with repairs lasting from Friday noon until Monday noon, I could average these numbers to reach a “mean time to failure” or “MTTF” of 96 hours. I would probably also call my system vendor and demand that they replace this horribly unreliable device!

Most systems only occasionally fail, so it is important to think of reliability in statistical terms. Manufacturers often run controlled tests to see how reliable a device is expected to be, and sometimes report these results to buyers. This is a good indication of the reliability of a device, as long as these manufacturer tests are reasonably accurate. Unfortunately, many vendors refer to this metric as “mean time between failure” (MTBF), which is incorrect as we shall soon see.

Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device by a good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or over 100 years. But no one should expect a given hard disk drive to last this long. In fact, disk replacement rate is much higher than disk failure rate!

Mean Time to Repair (MTTR)

Many vendors suppose that repairs are instantaneous or non-existent, but IT professionals know that this is not the case. In fact, I might still be a systems administrator if it wasn’t for the fact that I had to spend hours in freezing cold datacenters trying to repair failed systems! The amount of time required to repair a system and bring it back online is the “time to repair”, another critical metric.

In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves three days, or 72 hours, to get things operational again. Over time, we would come to expect a “mean time to repair” or “MTTR” of 72 hours for any typical failure. Again, we would be justified in complaining to the vendor at this point.

Repairs can be excruciating, but they often do not take anywhere near as long as this. In fact, most computer systems and devices are wonderfully reliable, with MTTF measured in months or years. But when things do go wrong, it can often take quite a while to diagnose, replace, or repair the failure. Even so, MTTR in IT systems tends to be measured in hours rather than days.

Mean Time Between Failures (MTBF)

The most common failure related metric is also mostly used incorrectly. “Mean time between failures” or “MTBF” refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of MTTF and MTTR, the total time required for a device to fail and that failure to be repaired.

For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72 hours would have an MTBF of one week, or 168 hours. But many disk drives only fail once in their life, and most never fail. So manufacturers don’t bother to talk about MTTR and instead use MTBF as a shorthand for average failure rate over time. In other words, “MTBF” often reflects the number of drives that fail rather than the rate at which they fail!

Stephen’s Stance

Most computer industry vendors use the term “MTBF” rather indiscriminately. But IT pros know that systems do not magically repair themselves, at least not yet, so MTTR and MTTF are just as important!

You might also want to read these other posts...

Comments

mobesuccess1 says

September 29, 2015 at 8:52 am

if i can do it you can do it better
just follow the link 1900+ per day
http://www.mytimespic.com/3
Ethan Wilson says

April 12, 2017 at 8:58 pm

Thanks for the explanation! I am finishing my Security+
Edhighere Mena Princewill says

August 5, 2017 at 8:12 pm

nice piece with most explanations here based on system and server admin but i could easily relate them to telecommunication field operation management…thanks
Hardlinelogic says

August 14, 2017 at 4:27 pm

Ha! Me Too! Good luck! did you pass?!
Tom Roltsch says

May 16, 2018 at 2:43 pm

Stephen,

You know IT but not reliability engineering. The sum of MTTF and MTTR is not MTBF. Mean time to failure (MTTF) is a metric used for non-repairable systems, like light bulbs, that have a useful life and then are discarded when they fail. Mean time between failure (MTBF) is used for repairable systems. It is the average operational time between failures. It is very important to note that this average is over the entire useful life of the system. Because MTBF is an average over the system lifetime, it is the most likely estimator for the rate in a homogenous Poisson process. Thus, both MTTF and MTBF are reciprocals of the failure rate for the non-repairable device or the repairable system. This allows us to calculate reliability (the probability that the device or system will not fail) for any time interval. Repairable systems have a failure function and a restore function. MTTF and MTBF are techniques used to estimate the failure function. Maintainability (MTTR) is the probability that the system will be restored to service in a given amount of time. It is a technique used to estimate the restore function. MTBF and MTTR represent two distinct processes in a Markov chain: failure and restore. Their sum is meaningless (e.g. MTBSI is a meaningless metric), The meaningful information about the failure function and the repair function is evident only when MTBF and MTTR are expressed separately.
meh says

July 3, 2018 at 6:46 pm

Is there a metric for “number of uses until failure”? I’m trying to figure out the metrics involved in software stability in general, and in some cases MTBF is a good metric where the activity is held over a period of time, but in other cases it would be useful to have a dedicated metric to repetitive actions until failures. Is there such a metric?
Thank you.
Tom j roltsch says

July 5, 2018 at 5:35 pm

Yes. Generally one uses the metric that is most meaningful for the system. Autos often use mean miles between failure and guns often use number of rounds fired between failures. Switches and relays use number of cycles between failure.
Aaron Alexis says

April 5, 2019 at 10:16 am

Thanks for sharing such an informative post!! This is a great article, and something I think needs to be communicated more often. Unexpected failures are one of the main causes of high maintenance cost. So, one must use effective downtime tracking software in their industry. Professionals are recommending Thrive’s OEE Tracking and downtime tracking software, it helps in reducing downtime and increasing the ROI.
nick says

May 28, 2019 at 10:38 pm

Ya your right this article is false and should be taken down

GPS Time Rollover Failures Keep Happening (But They’re Almost Done)

This is week “1111111111” in the GPS system. Tomorrow morning it will roll over to week “0000000000”. How well will various systems handle this change? Not well, judging by what we’ve seen so far!

Ranting and Raving About the 2018 iPad Pro

I remain enthusiastic about the iPad Pro, despite getting a scratched screen and my concerns about durability. It’s a worthy successor to the original and offers enough improvements that I’d recommend the upgrade for just about anyone who uses their iPad for serious work. It’s still not yet a laptop replacement, but this is due more to a lack of desktop-class software for iOS than anything in Apple’s control.

Hands-On Review: Unicomp Spacesaver M Keyboard for Mac

July 3, 2012

I would not hesitate to recommend the Unicomp Spacesaver M to Macintosh users used to an original IBM Model M, and I am admittedly a tough customer. I wish that Unicomp would update their website, packaging, logo, and keyboard graphics, but none of this really matters as your fingers press the keys. If any keyboard is worth $100, it is the Unicomp Spacesaver M!

Review: 2013 Ford Flex

September 23, 2012

The 2013 Ford Flex exceeded my expectations in every driving and utility-related area, but the terrible MyFord Touch system really detracts from the vehicle. I would highly recommend buying a Ford Flex if you’ve got a family of 5 or 6, but try to get one with old fashioned buttons and dials!

Storage Changes in VMware vSphere 5.1

September 4, 2012

As I have done since version 3.5, I’m charting the storage changes in VMware’s latest release of vSphere, 5.1. Unlike version 5, which included many new technical storage features, 5.1 mainly tweaks existing features and adds these new elements to the mix.

Sony NEX-5 Camera Review

September 15, 2010

The world of photography is like so many others: A vast gulf separates the amateurs and enthusiasts, from equipment to nomenclature to skills. I am decidedly in the amateur camp when it comes to photography, but I recently upgraded to a new compact interchangeable-lense camera, the Sony alpha NEX-5. It is an excellent match for my needs, allowing me to expand my skills and explore more advanced photographic techniques without sacrificing portability and ease of use.

What’s (Still) Wrong With Dropbox For Business

April 17, 2013

I am a heavy (and paying) user of Dropbox, using it both for business and personal storage and synchronization. Although I find the service incredibly useful, Dropbox is far from perfect, especially for business users. So I thought I would take a few moments to talk about what I’d like to see Dropbox improve.

What More Could Alan Turing Have Accomplished?

October 7, 2012

Many of you have probably heard the name of Alan Turing, but most of those probably don’t appreciate the extent of his contributions. To say that he invented the modern world is an overstatement, but he did dream up the computers we see around us today, and helped win World War II in the process. But the story of Alan Turing is as much about exclusion and defeat as it is of genius.

Marketers: Fudging the Meaning of Buzzwords Matters (To You!)

December 2, 2015

Too many marketers and salespeople play fast and loose with words, but they’re only hurting themselves. Improper usage is embarrassing and causes a loss of credibility with the people they most want to reach. It would be wise to spend a lot more time being correct and a little less time jumping on bandwagons and buzzwords!

The Myths of Standardization

December 15, 2011

I certainly benefit from standardization of the world around me, and I welcome interoperability and interchangeability as well as the price and product selection advantages. But I am not blithely focused on standardization above all else. I will happily use a proprietary solution if the alternative is inelegant, ineffective, or insufficient.