Most IT professionals are used to talking about uptime, downtime, and system failure. But not everyone is entirely clear on the definition of the terms widely used in the industry. What exactly differentiates “mean time to failure” from “mean time between failures”? And how does “mean time to repair” play into it? Let’s get some definitions straight!
Definition of a Failure
I suppose it is wise to begin by considering what exactly qualifies as a “failure.” Clearly, if the system is down, it has failed. But what about the system running in degraded mode, such as a raid array that is rebuilding? And what about systems that are intentionally brought off-line?
Technically speaking, a failure is declared when the system does not meet its desired objectives. When comes to IT systems, including disk storage, this generally means an outage or down time. But I have experienced situations where the system was running so slowly that it should be considered failed even though it was technically still “up.” Therefore, I consider any system that cannot meet minimum performance or availability requirements to be “failed.”
Similarly, a return to normal operations signals the end of downtime or system failure. Perhaps the system is still in a degraded mode, with some nodes or data protection systems not yet online, but if it is available for normal use I would consider it to be “non-failed.”
Mean Time to Failure (MTTF)
The first metric that we should understand is the time that a system is not failed, or is available. Often referred to as “uptime” in the IT industry, the length of time that a system is online between outages or failures can be thought of as the “time to failure” for that system.
For example, if I bring my RAID array online on Monday at noon and the system functions normally until a disk failure Friday at noon, it was “available” for exactly 96 hours. If this happens every week, with repairs lasting from Friday noon until Monday noon, I could average these numbers to reach a “mean time to failure” or “MTTF” of 96 hours. I would probably also call my system vendor and demand that they replace this horribly unreliable device!
Most systems only occasionally fail, so it is important to think of reliability in statistical terms. Manufacturers often run controlled tests to see how reliable a device is expected to be, and sometimes report these results to buyers. This is a good indication of the reliability of a device, as long as these manufacturer tests are reasonably accurate. Unfortunately, many vendors refer to this metric as “mean time between failure” (MTBF), which is incorrect as we shall soon see.
Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device by a good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or over 100 years. But no one should expect a given hard disk drive to last this long. In fact, disk replacement rate is much higher than disk failure rate!
Mean Time to Repair (MTTR)
Many vendors suppose that repairs are instantaneous or non-existent, but IT professionals know that this is not the case. In fact, I might still be a systems administrator if it wasn’t for the fact that I had to spend hours in freezing cold datacenters trying to repair failed systems! The amount of time required to repair a system and bring it back online is the “time to repair”, another critical metric.
In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves three days, or 72 hours, to get things operational again. Over time, we would come to expect a “mean time to repair” or “MTTR” of 72 hours for any typical failure. Again, we would be justified in complaining to the vendor at this point.
Repairs can be excruciating, but they often do not take anywhere near as long as this. In fact, most computer systems and devices are wonderfully reliable, with MTTF measured in months or years. But when things do go wrong, it can often take quite a while to diagnose, replace, or repair the failure. Even so, MTTR in IT systems tends to be measured in hours rather than days.
Mean Time Between Failures (MTBF)
The most common failure related metric is also mostly used incorrectly. “Mean time between failures” or “MTBF” refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of MTTF and MTTR, the total time required for a device to fail and that failure to be repaired.
For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72 hours would have an MTBF of one week, or 168 hours. But many disk drives only fail once in their life, and most never fail. So manufacturers don’t bother to talk about MTTR and instead use MTBF as a shorthand for average failure rate over time. In other words, “MTBF” often reflects the number of drives that fail rather than the rate at which they fail!
Most computer industry vendors use the term “MTBF” rather indiscriminately. But IT pros know that systems do not magically repair themselves, at least not yet, so MTTR and MTTF are just as important!
if i can do it you can do it better
just follow the link 1900+ per day
Ethan Wilson says
Thanks for the explanation! I am finishing my Security+
Edhighere Mena Princewill says
nice piece with most explanations here based on system and server admin but i could easily relate them to telecommunication field operation management…thanks
Ha! Me Too! Good luck! did you pass?!
Tom Roltsch says
You know IT but not reliability engineering. The sum of MTTF and MTTR is not MTBF. Mean time to failure (MTTF) is a metric used for non-repairable systems, like light bulbs, that have a useful life and then are discarded when they fail. Mean time between failure (MTBF) is used for repairable systems. It is the average operational time between failures. It is very important to note that this average is over the entire useful life of the system. Because MTBF is an average over the system lifetime, it is the most likely estimator for the rate in a homogenous Poisson process. Thus, both MTTF and MTBF are reciprocals of the failure rate for the non-repairable device or the repairable system. This allows us to calculate reliability (the probability that the device or system will not fail) for any time interval. Repairable systems have a failure function and a restore function. MTTF and MTBF are techniques used to estimate the failure function. Maintainability (MTTR) is the probability that the system will be restored to service in a given amount of time. It is a technique used to estimate the restore function. MTBF and MTTR represent two distinct processes in a Markov chain: failure and restore. Their sum is meaningless (e.g. MTBSI is a meaningless metric), The meaningful information about the failure function and the repair function is evident only when MTBF and MTTR are expressed separately.
Is there a metric for “number of uses until failure”? I’m trying to figure out the metrics involved in software stability in general, and in some cases MTBF is a good metric where the activity is held over a period of time, but in other cases it would be useful to have a dedicated metric to repetitive actions until failures. Is there such a metric?
Tom j roltsch says
Yes. Generally one uses the metric that is most meaningful for the system. Autos often use mean miles between failure and guns often use number of rounds fired between failures. Switches and relays use number of cycles between failure.
Aaron Alexis says
Thanks for sharing such an informative post!! This is a great article, and something I think needs to be communicated more often. Unexpected failures are one of the main causes of high maintenance cost. So, one must use effective downtime tracking software in their industry. Professionals are recommending Thrive’s OEE Tracking and downtime tracking software, it helps in reducing downtime and increasing the ROI.
Ya your right this article is false and should be taken down