I’m currently reading Designing Data-Intensive Applications. Quite early in the book an interesting application of basic probability theory pops up. The author says “Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day”. The point they’re trying to make is that even with a component which is quite reliable, you will see a lot of issues at scale. Therefore large scale distributed systems need to acknowledge this fact and be resistant to the failures. I think there’s some neat math behind the statement itself, irrespective of its sobering effect on engineers. So let’s see how we would go about deriving it.
The first thing to do is lay some groundwork for modeling the problem. Let’s say we have computers. A computer can be in the working or failed states. All computers start up as working, but later can fail. Failure is temporary however, and a machine which fails will be quickly fixed. We are interested in the behavior of these computers with regards to failures across time. We consider time as flowing in discrete steps, much like it would in a turn-based game. If a computer fails at a given time , we can assume it will be fixed by time (where it can fail again, of course).
Failure is a random phenomenon. At the individual computer level we can say that a given machine in a given day will fail with a certain probability . The whole ensemble of computers then has a probabilistic behavior. It is more complex however, and part of that behavior well study later.
We need to make a series of simplifying assumptions. First we consider the probability of failure to be equal amongst all the computers. Second, this probability is independent of any external parameters, such as time, previous failures, or the order of the computer in the set etc. Therefore the single parameter can describe the whole ensemble’s behaviour. Note that this is not an entirely realistic expectation. Some failures are correlated. If a whole rack in a datacenter loses power, than all of the machines in that rack will fail. The fact that the machines share some property (being in the same rack), means that their failures are correlated, or not independent. Other machines just have hidden faults which cause abnormally high failure rates after a point. If you’ve seen a failure from it, you’re likely to see another shortly. However, the model is decent enough to capture some good insights.
With this in mind, we can say that a failure for computer at time can be modelled as a random variable with a Bernoulli distribution, with its single parameter set to . For all and we get the same model, and these random variables are independent and identically distributed.
At this point the question from the beginning can be stated as “what is the expected number of failures in a certain day , given that the probability of one machine failing is ?”. To figure this out, we need to look at - the number of failures in that day. is the same for all values of . We can write since each produces either a or a , for whether the machine is in the failure or in the working state. The resulting model for is the Binomial distribution with parameters and . The expected or average number of machine failures on a given day is just .
So, for machines and an arbitrarily chosen low we’d expect to see one failure a day. We can use Wolfram Alpha to get some more insights. The graph is quite bad unfortunately, but there’s some good stuff as well. For example, the probability of at least one machine failure (which is referred to as “success” usually) is . Or, one example of the numbers of failures across time is given as . So on some days we even see failures, and on others none at all.
The last step is to obtain from the . We’ll focus on a single machine , but look at it across time. The machine starts up in a functioning state, and given its relatively high reliability, chugs along for a number of days . However, at it stops functioning for the first time. Since this is a random process, is itself a random variable. A good model for it is the geometric distribution with its parameter set to , which is “the probability distribution of the number X of Bernoulli trials needed to get one success”. The mean for this thing is . But it is also our own , if we express it in days. So we can compute .
So, for our initial of years, we get a , which is thrice larger than our original test value of . So with our model and data we’d actually expect failures a day. Wolfram Alpha has more insights into the geometric distribution’s behaviour with these parameters. One example of the times to failures is given as . So, again, we can see there’s some variance. Some failures occur within years, while others in .
A special note is warranted here. If you pay attention to the sample of failure times, you’ll notice that some are quite big. Which should raise some eyebrows. Indeed, because of our earlier assumptions about independence in time, we missed out on a lot of extra complexity in the failure behaviours of these computers. Failure modelling and reliability engineering is a more complex topic, and more precise results can be obtained.
Finally, let’s put things together to get a nice formula. Starting with a given in years, provided by the manufacturer, we can compute the probability of failure on a single day, under all our model’s assumptions as . From this, we can compute the average number of failures in a day as . Which is quite neat and hopefully easy to remember.