A deeper dive into disk drive survival time


Evaluating newer classes in the context of historical failure data: Time windowed KM survival curves


Background: 

A substantial proportion of online data and services rely on hard disk drives that form a ubiquitous part of modern information infrastructure, so reliable statistical analysis of differences in failure over time for different disk drive models is of particular interest to those responsible for maintaining storage integity at home or at work. The Backblaze hard disk failure data represent an interesting "big data" analytic opportunity to compare enterprise and consumer hard disk drives over time under real world operating conditions. In this article, some statistical issues are discussed and the results of a some simple analyses are presented. The results provide interesting insight that cannot be obtained by the use of simple descriptive statistics and the statistical tests show that many of the differences observed are important and unlikely to have arisen purely by chance.

Survival analysis methods and their relevance to disk drive failure times:

Life table methods, often termed "survival" analysis, were initially developed for use in estimating life insurance risk by actuaries, and were refined to study time to an outcome event such as death or cancer recurrence in clinical data. When applied to the Backblaze disk failure time data, life table methods can provide statistical evidence and insight into differences in active service life-spans for different disk drive models and for different manufacturers, allowing them to be ranked in terms of the risk of failure during their working lives so the best ones can be selected. However, there are some important differences in the way the Backblaze data arise compared to the typical context in which life table methods are used.

Clinical data is analysed retrospectively, at the end of a clinical study, during which the treatment being studied is usually allocated randomly to patients as they enter the study population. As a result, membership of the "classes" of observed object to be compared, such as placebo treated patients and actively treated patients, is randomly distributed over the entire observation period. In the Backblaze data, disk drive models correspond to patient treatment status in clinical data, but new drive models form new classes as they appear on the market irregularly over time.

Time under observation provides information about event rates over time, so when the first of a new model hard disk drive is deployed into active service, there is zero time under observation for members of this new class available for study, although many other models may have years of observations. In the context of survival analysis by disk drive model, limited drive-days of observation for the newest products provides relatively little information about failure rates. Survival information about newer drives will always be confined to the upper left part of the survival curve over time compared with older drive models for which large amounts of historical data are displayed.

This peculiarity of the data is a challenge for graphical methods such as Kaplan Meier survival curves, producing potentially misleading curves over the survival data time period, because the short observation times for newer drives are obscured in the dense upper left corner of the plot, where curves for each class originate at time zero. Hard disk drive consumers and enterprise buyers want to know as soon as possible whether a new model disk drive can be expected to perform well in terms of failure risk. More generally the challenge applies when applying survival methods to continuously accumulating data on the survival of rapidly evolving products such as LED light globes, where manufacturers update their models routinely so new product classes are added to the data stream irregularly. Note that this problem applies only to the KM survival curve plots. The non-parametric KM statistics remain valid and comparable since they take the total time under observation for each class into account in their estimates of statistical uncertainty.

Ideally, a graphical method should introduce no distortions due to the larger amount of information available from other classes with overwhelmingly large amounts of historical data. One approach to make performance of the newer drives more comparable, would be to run a survival analysis on a time span limited to the weeks or months of observations available about the new drive. This right censored analysis will compare all drives for as many weeks or months as we have available for the newer model.  In practice, it is easy to take all the historical data and partition it into a number of time-limited windows, all starting at time zero and increasing in width. Each of these is a subset of the failure data, each limited to some smaller period than the whole data set.

For example, below we look at subsets containing only the first week. Similarly it is easy to look only at the first month or at the entire period of observation. A separate KM model is run for each of those subsets. They are not statistically independent but allow more recent models to be compared on an equal basis with the older models.

Methods:

To avoid the artifacts seen in KM curves with widely variable time under observation, such as with the Backblaze disk drive failure data, which now has some drives with relatively short observation periods, we remove any class with zero observations beyond the last day in the window. As a result of this, for the newest drives, the last window in which they appear offers an unbiased visualisation of their survival compared to other classes, using all available and comparable data in terms of period under observation. These sub-window KM curves also allow other interesting issues to be addressed such as how long a period of observation is enough to detect classes with particularly poor survival.

One other change is to determine whether each drive has higher or lower failure rates than the average across all drive classes, designating a lower than average rate as negative and a higher than average rate as positive. Applying this sign to the individual chisquared statistics allows us to rank the drives in terms of a statistical measure of how much better (worse) they are than the mean over all drives. A summary table at the end of this post gives the rank descriptive statistics (mean and variance) over all of the periods tested.

Results:

It turns out that there is a lot of information available after only a few weeks of observation in terms of predicting the final rank of the drive failure risk, but that over time, different disk drive models can exhibit markedly different patterns. Some drive rankings undergo substantial shifts over long observation time windows. Some drives exhibited misleadingly large early failure rates, perhaps indicating poor manufacturer quality control, then settled down to consistently very low failure rates. Others perform well in the first weeks, but fall to outstandingly poor reliability levels after some time has elapsed. With those few exceptions, the top ranked 5 drives and the bottom ranked 5 drives tended to stay in the same tertile over time. The middling drive ranks had very high variances as they bounced around from period to period but all remaining in the middle tertile much of the time. In the last table, the middling drives have higher rank variance values compared to the best and worst drives.

If someone knows where this has been done before, or can see why this is biased, please let me know.

First 7 days


The statistics below take the mean number of failures in all drives over the entire period and uses that to calculate an "expected" failure count for each model based on the number of drive days of observation. They suggest that the Hitachi HDS722020ALA330 had the "best" (top ranked) performance with 0 failures and 5.04 expected based on the number of drives and the overall mean failure rate over a week. The worst drive was the WDC WD30EFRX with 10 failures in the week when only 1.4 would be expected. Large chisquare statistics such as those for each of the two extremes, suggests that the difference between what was observed and what could be expected of the average drive was unlikely to have occurred purely by chance. In the case of some of the middle ranked drives, there is no evidence that they were different from the average drive which makes sense since they are in the middle of the pack.



 *** KM statistics for first 7 days

                                                     groups  Freq Observed   Expected        chisq          sort rank
model=Hitachi HDS722020ALA330 model=Hitachi HDS722020ALA330  4774        0  5.0383700 5.038370e+00 -5.0383700236    1
model=Hitachi HDS5C4040ALE630 model=Hitachi HDS5C4040ALE630  2719        1  2.8691843 1.217715e+00 -1.2177154440    2
model=ST8000DM002                         model=ST8000DM002  9936        7 10.4775364 1.154208e+00 -1.1542083133    3
model=Hitachi HDS723030ALA640 model=Hitachi HDS723030ALA640  1048        0  1.1060351 1.106035e+00 -1.1060351455    4
model=Hitachi HDS5C3030ALA630 model=Hitachi HDS5C3030ALA630  4664        3  4.9221364 7.506107e-01 -0.7506106962    5
model=HGST HMS5C4040ALE640       model=HGST HMS5C4040ALE640  8642        7  9.1175125 4.917854e-01 -0.4917853679    6
model=ST3000DM001                         model=ST3000DM001  4707        4  4.9645325 1.873939e-01 -0.1873938724    7
model=ST8000NM0055                       model=ST8000NM0055  2460        2  2.5960854 1.368667e-01 -0.1368667432    8
model=ST4000DM000                         model=ST4000DM000 36611       38 38.6164413 9.840364e-03 -0.0098403644    9
model=ST6000DX000                         model=ST6000DX000  1937        2  2.0422247 8.730328e-04 -0.0008730328   10
model=ST500LM012 HN                     model=ST500LM012 HN   806        1  0.8497100 2.658212e-02  0.0265821183   11
model=WDC WD10EADS                       model=WDC WD10EADS   550        1  0.5803152 3.035167e-01  0.3035166758   12
model=ST31500541AS                       model=ST31500541AS  2188        4  2.3087384 1.238930e+00  1.2389301761   13
model=HGST HMS5C4040BLE640       model=HGST HMS5C4040BLE640 15464       21 16.2921682 1.360389e+00  1.3603886486   14
model=ST31500341AS                       model=ST31500341AS   787        3  0.8294651 5.679831e+00  5.6798313991   15
model=WDC WD30EFRX                       model=WDC WD30EFRX  1321       10  1.3895444 5.335558e+01 53.3555763921   16





First 30 days

*** KM statistics for first 30 days
                                                     groups  Freq Observed   Expected        chisq         sort rank
model=Hitachi HDS722020ALA330 model=Hitachi HDS722020ALA330  4774        2  14.536034  10.81121264 -10.81121264    1
model=ST8000DM002                         model=ST8000DM002  9936       14  30.205722   8.69455863  -8.69455863    2
model=HGST HMS5C4040ALE640       model=HGST HMS5C4040ALE640  8642       14  25.412999   5.12558757  -5.12558757    3
model=Hitachi HDS5C4040ALE630 model=Hitachi HDS5C4040ALE630  2719        3   8.273825   3.36159284  -3.36159284    4
model=HGST HMS5C4040BLE640       model=HGST HMS5C4040BLE640 15464       31  41.803699   2.79209531  -2.79209531    5
model=ST8000NM0055                       model=ST8000NM0055  2460        2   3.825839   0.87136144  -0.87136144    6
model=ST500LM012 HN                     model=ST500LM012 HN   806        1   2.449486   0.85773453  -0.85773453    7
model=Hitachi HDS723030ALA640 model=Hitachi HDS723030ALA640  1048        2   3.191063   0.44456356  -0.44456356    8
model=Hitachi HDS5C3030ALA630 model=Hitachi HDS5C3030ALA630  4664       12  14.187829   0.33737336  -0.33737336    9
model=ST4000DM000                         model=ST4000DM000 36611      110 111.271608   0.01453188  -0.01453188   10
model=ST6000DX000                         model=ST6000DX000  1937        7   5.884035   0.21165384   0.21165384   11
model=ST3000DM001                         model=ST3000DM001  4707       18  14.296240   0.95954192   0.95954192   12
model=WDC WD10EADS                       model=WDC WD10EADS   550        5   1.671228   6.63029058   6.63029058   13
model=ST31500541AS                       model=ST31500541AS  2188       26   6.633756  56.53681144  56.53681144   14
model=WDC WD30EFRX                       model=WDC WD30EFRX  1321       21   3.980120  72.78080001  72.78080001   15
model=ST31500341AS                       model=ST31500341AS   787       22   2.376519 162.03577702 162.03577702   16




First 60 days 

[[1] *** KM statistics for first 60 days
                                                     groups  Freq Observed   Expected        chisq         sort rank
model=HGST HMS5C4040ALE640       model=HGST HMS5C4040ALE640  8642       17  39.379147  12.71805620 -12.71805620    1
model=Hitachi HDS722020ALA330 model=Hitachi HDS722020ALA330  4774        7  23.545177  11.62628216 -11.62628216    2
model=ST8000DM002                         model=ST8000DM002  9936       28  48.844870   8.89568552  -8.89568552    3
model=HGST HMS5C4040BLE640       model=HGST HMS5C4040BLE640 15464       39  62.474931   8.82069615  -8.82069615    4
model=Hitachi HDS5C3030ALA630 model=Hitachi HDS5C3030ALA630  4664       13  22.971503   4.32844434  -4.32844434    5
model=Hitachi HDS5C4040ALE630 model=Hitachi HDS5C4040ALE630  2719        7  13.398179   3.05539241  -3.05539241    6
model=ST8000NM0055                       model=ST8000NM0055  2460        2   3.939151   0.95459802  -0.95459802    7
model=Hitachi HDS723030ALA640 model=Hitachi HDS723030ALA640  1048        3   5.164714   0.90730818  -0.90730818    8
model=ST500LM012 HN                     model=ST500LM012 HN   806        3   3.964394   0.23460233  -0.23460233    9
model=ST4000DM000                         model=ST4000DM000 36611      175 180.062299   0.14232222  -0.14232222   10
model=ST6000DX000                         model=ST6000DX000  1937        9   9.521104   0.02852082  -0.02852082   11
model=WDC WD10EADS                       model=WDC WD10EADS   550        8   2.699134  10.41044282  10.41044282   12
model=ST3000DM001                         model=ST3000DM001  4707       47  23.112913  24.68719309  24.68719309   13
model=WDC WD30EFRX                       model=WDC WD30EFRX  1321       27   6.410953  66.12259435  66.12259435   14
model=ST31500541AS                       model=ST31500541AS  2188       42  10.697858  91.59068684  91.59068684   15
model=ST31500341AS                       model=ST31500341AS   787       33   3.813675 223.36502391 223.36502391   16



First 90 days

[1] *** KM statistics for first 90 days
                                                     groups  Freq Observed   Expected       chisq        sort rank
model=Hitachi HDS722020ALA330 model=Hitachi HDS722020ALA330  4774        8  32.364762  18.3422215 -18.3422215    1
model=HGST HMS5C4040BLE640       model=HGST HMS5C4040BLE640 15464       43  81.185631  17.9605969 -17.9605969    2
model=HGST HMS5C4040ALE640       model=HGST HMS5C4040ALE640  8642       27  52.600241  12.4594932 -12.4594932    3
model=ST8000DM002                         model=ST8000DM002  9936       37  64.909950  12.0007075 -12.0007075    4
model=Hitachi HDS5C3030ALA630 model=Hitachi HDS5C3030ALA630  4664       15  31.574779   8.7007192  -8.7007192    5
model=Hitachi HDS5C4040ALE630 model=Hitachi HDS5C4040ALE630  2719       10  18.407643   3.8401689  -3.8401689    6
model=ST500LM012 HN                     model=ST500LM012 HN   806        3   5.444532   1.0975663  -1.0975663    7
model=ST8000NM0055                       model=ST8000NM0055  2460        2   4.050161   1.0377761  -1.0377761    8
model=Hitachi HDS723030ALA640 model=Hitachi HDS723030ALA640  1048        5   7.096578   0.6194026  -0.6194026    9
model=ST6000DX000                         model=ST6000DX000  1937       11  13.077590   0.3300593  -0.3300593   10
model=ST4000DM000                         model=ST4000DM000 36611      242 247.290108   0.1131677  -0.1131677   11
model=WDC WD10EADS                       model=WDC WD10EADS   550        9   3.699493   7.5943869   7.5943869   12
model=ST3000DM001                         model=ST3000DM001  4707       70  31.690612  46.3105345  46.3105345   13
model=WDC WD30EFRX                       model=WDC WD30EFRX  1321       34   8.764192  72.6645468  72.6645468   14
model=ST31500541AS                       model=ST31500541AS  2188       60  14.649045 140.3988490 140.3988490   15
model=ST31500341AS                       model=ST31500341AS   787       46   5.194684 320.5341949 320.5341949   16





For the record, some more individual periods:

[1] *** KM statistics for first 120 days
                                                     groups  Freq Observed   Expected        chisq         sort rank
model=HGST HMS5C4040BLE640       model=HGST HMS5C4040BLE640 15464       45 100.624668  30.74895800 -30.74895800    1
model=ST8000DM002                         model=ST8000DM002  9936       45  81.170497  16.11798467 -16.11798467    2
model=Hitachi HDS722020ALA330 model=Hitachi HDS722020ALA330  4774       16  41.502023  15.67039722 -15.67039722    3
model=Hitachi HDS5C3030ALA630 model=Hitachi HDS5C3030ALA630  4664       16  40.492900  14.81499601 -14.81499601    4
model=HGST HMS5C4040ALE640       model=HGST HMS5C4040ALE640  8642       35  66.314593  14.78714845 -14.78714845    5
model=Hitachi HDS5C4040ALE630 model=Hitachi HDS5C4040ALE630  2719       16  23.589115   2.44157809  -2.44157809    6
model=ST500LM012 HN                     model=ST500LM012 HN   806        3   6.979138   2.26869575  -2.26869575    7
model=Hitachi HDS723030ALA640 model=Hitachi HDS723030ALA640  1048        6   9.096398   1.05400830  -1.05400830    8
model=ST4000DM000                         model=ST4000DM000 36611      299 316.781484   0.99810502  -0.99810502    9
model=ST6000DX000                         model=ST6000DX000  1937       18  16.755360   0.09245569   0.09245569   10
model=WDC WD10EADS                       model=WDC WD10EADS   550       10   4.732107   5.86434385   5.86434385   11
model=ST3000DM001                         model=ST3000DM001  4707       92  40.503210  65.47430294  65.47430294   12
model=WDC WD30EFRX                       model=WDC WD30EFRX  1321       41  11.170640  79.65440579  79.65440579   13
model=ST31500541AS                       model=ST31500541AS  2188       85  18.684624 235.36620950 235.36620950   14
model=ST31500341AS                       model=ST31500341AS   787       58   6.603243 400.04986214 400.04986214   15
[1] *** KM statistics for first 360 days
                                                     groups  Freq Observed  Expected        chisq         sort rank
model=HGST HMS5C4040BLE640       model=HGST HMS5C4040BLE640 15464       83 234.63083   97.9918497  -97.9918497    1
model=HGST HMS5C4040ALE640       model=HGST HMS5C4040ALE640  8642       74 185.41048   66.9449458  -66.9449458    2
model=Hitachi HDS5C3030ALA630 model=Hitachi HDS5C3030ALA630  4664       34 118.02503   59.8195618  -59.8195618    3
model=Hitachi HDS722020ALA330 model=Hitachi HDS722020ALA330  4774       45 120.72734   47.5006756  -47.5006756    4
model=ST4000DM000                         model=ST4000DM000 36611      757 911.36813   26.1469749  -26.1469749    5
model=Hitachi HDS5C4040ALE630 model=Hitachi HDS5C4040ALE630  2719       31  68.47720   20.5110673  -20.5110673    6
model=ST6000DX000                         model=ST6000DX000  1937       24  48.38267   12.2877622  -12.2877622    7
model=Hitachi HDS723030ALA640 model=Hitachi HDS723030ALA640  1048       14  26.46043    5.8677167   -5.8677167    8
model=ST500LM012 HN                     model=ST500LM012 HN   806        9  19.76770    5.8652927   -5.8652927    9
model=WDC WD10EADS                       model=WDC WD10EADS   550       17  13.66412    0.8144015    0.8144015   10
model=WDC WD30EFRX                       model=WDC WD30EFRX  1321       77  31.36944   66.3750537   66.3750537   11
model=ST31500541AS                       model=ST31500541AS  2188      164  51.65516  244.3388268  244.3388268   12
model=ST31500341AS                       model=ST31500341AS   787      117  16.17115  628.6789204  628.6789204   13
model=ST3000DM001                         model=ST3000DM001  4707      513 112.89032 1418.0822443 1418.0822443   14
[1] *** KM statistics for first 720 days
                                                     groups  Freq Observed   Expected        chisq         sort rank
model=HGST HMS5C4040ALE640       model=HGST HMS5C4040ALE640  8642       92  477.24996  310.9848981 -310.9848981    1
model=HGST HMS5C4040BLE640       model=HGST HMS5C4040BLE640 15464       89  362.09015  205.9659138 -205.9659138    2
model=Hitachi HDS5C3030ALA630 model=Hitachi HDS5C3030ALA630  4664       64  308.22647  193.5153994 -193.5153994    3
model=Hitachi HDS722020ALA330 model=Hitachi HDS722020ALA330  4774       98  314.31556  148.8708418 -148.8708418    4
model=ST4000DM000                         model=ST4000DM000 36611     1337 1838.57536  136.8330339 -136.8330339    5
model=Hitachi HDS5C4040ALE630 model=Hitachi HDS5C4040ALE630  2719       53  178.45503   88.1956924  -88.1956924    6
model=ST6000DX000                         model=ST6000DX000  1937       40  125.38520   58.1458758  -58.1458758    7
model=Hitachi HDS723030ALA640 model=Hitachi HDS723030ALA640  1048       34   68.73314   17.5518130  -17.5518130    8
model=ST500LM012 HN                     model=ST500LM012 HN   806       29   36.14162    1.4111919   -1.4111919    9
model=WDC WD10EADS                       model=WDC WD10EADS   550       40   34.66398    0.8214041    0.8214041   10
model=WDC WD30EFRX                       model=WDC WD30EFRX  1321      132   77.35618   38.5999815   38.5999815   11
model=ST31500541AS                       model=ST31500541AS  2188      331  125.25201  337.9764981  337.9764981   12
model=ST31500341AS                       model=ST31500341AS   787      205   30.81223  984.7187021  984.7187021   13
model=ST3000DM001                         model=ST3000DM001  4707     1650  216.74311 9477.6960925 9477.6960925   14
[1] *** KM statistics for first 1100 days
                                                     groups  Freq Observed   Expected      chisq        sort rank
model=HGST HMS5C4040ALE640       model=HGST HMS5C4040ALE640  8642      107  403.71762 218.076548 -218.076548    1
model=Hitachi HDS5C3030ALA630 model=Hitachi HDS5C3030ALA630  4664      105  297.57840 124.627462 -124.627462    2
model=Hitachi HDS5C4040ALE630 model=Hitachi HDS5C4040ALE630  2719       72  172.40912  58.477135  -58.477135    3
model=Hitachi HDS722020ALA330 model=Hitachi HDS722020ALA330  4774      197  300.86066  35.853931  -35.853931    4
model=Hitachi HDS723030ALA640 model=Hitachi HDS723030ALA640  1048       47   65.97533   5.457544   -5.457544    5
model=WDC WD10EADS                       model=WDC WD10EADS   550       58   29.46460  27.635517   27.635517    6
model=ST4000DM000                         model=ST4000DM000 36611     1747 1444.93368  63.147576   63.147576    7
model=WDC WD30EFRX                       model=WDC WD30EFRX  1321      152   67.16620 107.148752  107.148752    8
model=ST31500541AS                       model=ST31500541AS  2188      392   94.89439 930.210371  930.210371    9



Here is a summary of the rank of each model in terms of the extent to which it performed better or worse than the average and descriptive statistics over each time window:

                     Model Rank_day_2 Rank_day_7 Rank_day_15 Rank_day_30 Rank_day_60 Rank_day_90 Rank_day_120 Rank_day_360 Rank_day_720 Rank_day_1100      mean        var
5  Hitachi HDS722020ALA330          2          1           1           1           2           1            3            4            4             5  2.111111  1.6111111
1     HGST HMS5C4040ALE640          1          6           5           3           1           3            5            2            1             1  3.000000  3.7500000
4  Hitachi HDS5C4040ALE630          4          2           3           4           6           6            6            6            6             4  4.777778  2.4444444
3  Hitachi HDS5C3030ALA630          3          5           7           9           5           5            4            3            3             3  4.888889  4.1111111
13             ST8000DM002         11          3           2           2           3           4            2           10           10            NA  5.222222 15.1944444
2     HGST HMS5C4040BLE640         15         14           8           5           4           2            1            1            2             2  5.777778 29.4444444
6  Hitachi HDS723030ALA640          8          4           4           8           8           9            8            8            8             9  7.222222  3.4444444
14            ST8000NM0055          5          8           6           6           7           8           11           11           11            NA  8.111111  5.6111111
11           ST500LM012 HN          9         11           9           7           9           7            7            9            9            NA  8.555556  1.7777778
10             ST4000DM000         13          9          13          10          10          11            9            5            5            NA  9.444444  8.5277778
12             ST6000DX000         12         10          10          11          11          10           10            7            7            NA  9.777778  2.9444444
7              ST3000DM001          7          7          12          12          13          13           12           14           14             7 11.555556  7.2777778
15            WDC WD10EADS         10         12          11          13          12          12           13           NA           NA            NA 11.857143  1.1428571
9             ST31500541AS          6         13          15          14          15          15           14           12           12             8 12.888889  8.1111111
8             ST31500341AS         14         15          14          16          16          16           15           13           13             6 14.666667  1.5000000


16            WDC WD30EFRX         16         16          16          15          14          14           NA           NA           NA            NA 15.166667  0.9666667


Comments

Popular posts from this blog

Survival analysis of hard disk drive failure data: Update to Q1 2016

Update to Q1 2017: Seagate redeemed?