Survival analysis of hard disk drive failure data.


Ross Lazarus, February 2016


Executive Summary:


Using a well established, objective analysis and data presentation method designed for right censored hard disk drive failure data provides insights which are not provided by simple descriptive statistics or charts. The Kaplan-Meier statistics and plots are recommended for routine use with hard drive failure data and their use is illustrated with 30M data points from the BackBlaze public data.

Introduction:


Hard disk drives are widely used for mass storage in servers, network attached storeage devices, laptops and desktop computers. Familiar and convenient as they are, these complex electro-mechanical devices are prone to sudden catastrophic failure, which can lead to very unpleasant consequences such as loss of data which was not securely backed up elsewhere. Selecting drive manufacturers and models for home or for commercial applications is complicated by the problem that objective and reliable measurements of the reliability of specific drive models or manufacturers is hard to find.

Subjective experience of individual consumers who purchase a few drives at a time is readily available in on-line product reviews at the larger retailers like Newegg or Amazon. These reviews are likely to be biased by negative reviews from those unlucky owners of a drive which happened to fail quickly - satisfied owners are less likely to take the time to share their experiences compared to unhappy owners who have just lost precious data.

Large commercial purchasers such as Google or Amazon probably do their own in-house testing, but rarely share their hard won findings or raw data. As drive capacities grow, new models are released on a regular basis but it takes at least 2 or 3 years of observation of a large number of sample drives under typical field operation conditions before robust conclusions can be drawn on the reliability over time for each new model.

The most recent analysis of about 50,000 hard disks deployed in a commercial on line storage facility over nearly 3 years run by Backblaze is one of the largest published studies and can be viewed at https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/ and in other Backblaze blogs. Simple statistics, tables and bar graphs derived from 30,301,566 observations are presented and discussed. That's a lot of data and the Backblaze engineers have done their best to make sense of it. Unfortunately I'm not sure that you can really see what's going on from their presentation. For example, time is split into 3 year-long intervals in the main table, making it confusing and hard to figure out what's really going on, and the summary bar charts hide an awful lot of interesting detail.

Failure time (or survival) analysis:


Part of the challenge in interpreting this type of data is the problem that at any point in time during the observation period, one or more drives (or patients or more generally, units of analysis) may fail, and one or more drives may be removed from any further study before failure because of firmware diagnostics or planned maintenance. In terms of statistical analysis, this problem is termed right censoring because no further information is available after a drive is removed. Right censoring must be taken into account in order to correctly calculate the instantaneous failure rate of drives in the context of drives removed from further observation at some point before they failed together with the remaining drives which have not failed (yet).

Epidemiologists and statisticians have established valid and robust methods for handling right censored data in the context of survival analysis, which are applicable to the Backblaze data. Survival rates are the inverse of failure rates, so survival and failure analysis are more or less mathematically equivalent, being two sides of the same technical coin although failure time analysis predominates in engineering circles whereas the survival analysis paradigm predominates in biology.

One popular method is the Kaplan-Meier (KM) plot and KM statistics, widely used to compare (for example) survival time after diagnosis for patients with the same cancer but different treatments. This kind of data is similar to the hard drive failure data because the reality is that it is almost inevitable that some patients in any clinical study will be lost to further follow up after a visit at which they were clearly alive. Those right censored patients, like the drives removed before failure, contribute no more information to the study, but do contribute useful information for the whole time they are being observed. Some details on where the data came from and how the analysis was performed are provided at the end of this article.

Application of survival analysis to hard disk drive failure data:


Here's a KM plot showing the survival of each drive by the manufacturer.



The vertical axis represents the fraction of drives which survived at any given point in time and the horizontal axis represents days since time zero. Each individual disk drive's history over time is "lined up" so the first day of observation is always at the far left, at time zero - like a race where each competitor starts at the same point, although in the raw data, drives were introduced to the pool continuously over the entire study period. Each manufacturer's drives are grouped together and their survival in service over time is plotted as a single line. When one or more drives fail, there is a small vertical step in the curve.  Each cross on each line represents a right censored observation removed from further study. Note that right censoring has no effect on the instantaneous survival rate - it simply changes the denominator for failure or conversely, survival rate calculations. Each downward step in each line represents one or more failures at that time.

Here is the Backblaze summary chart linked from their report at https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/ hard drive reliability by manufacturer

To my eye, the KM curve provides a much more detailed and arguably more accurate summary of what happened during the observation period. Note the curve for ST500LM012 which is an obvious anomaly arising from an abberant manufacturer string in the data ("ST500LM012 HN") where the two space delimited components in the data field are reversed (see below) compared to the majority of the data where the model follows the manufacturer abbreviation. This does not seem to have been noticed in the Backblaze analysis but the KM plot makes it obvious. No attempt has been made to correct this anomaly because it is not clear whether the model number means that the "HN" is wrong and should be replaced by "ST" - I'll leave that for the BackBlaze engineers to figure out and fix!

One example of a feature that was not at all obvious from the Backblaze analysis, but is clear from the KM plot, is the crossover in failure rate between ST (Seagate) and WDC (Western Digital). Initially, the WDC family failed slightly faster but the Seagate family of samples failed more quickly after about the first year of operation.

The KM statistical test estimates expected failure rates from mean failure rates and the number of units under observation at each time point and as shown below, suggests that drive survival is significantly different between manufacturers with some (eg HGST) having far fewer observed failures than expected and others (eg ST) having far more than expected, with a global Chisquared value of 2535 which is extremely unlikely to have arisen by chance alone :

                        N Observed Expected (O-E)^2/E (O-E)^2/V
manufact=HGST       10424      100   515.21  3.35e+02  4.08e+02
manufact=Hitachi    13244      385  1533.11  8.60e+02  1.53e+03
manufact=ST         32714     3266  1798.14  1.20e+03  2.21e+03
manufact=ST500LM012   377       22     8.89  1.93e+01  1.94e+01
manufact=TOSHIBA      254        9     9.15  2.59e-03  2.59e-03
manufact=WDC         3753      298   215.49  3.16e+01  3.34e+01

 Chisq= 2535  on 5 degrees of freedom, p= 0 

The KM plot pattern seems much easier to understand and at all obvious from the table or bar graphs shown in the original article.

For individual drive models, the KM curves are complex but even more revealing:


The KM curves show that one particular Seagate model failed at an unusually high rate over the entire period, whereas the curves at the top of the plot show a group of very reliable drive models which had very few failures over the entire period of observation. These individual drive model curves are made from the same data as the manufacturer curves but reveal a great deal of interesting variation within each manufacturer's offerings - again suggesting that descriptive and summary statistics presented in the Backblaze blogs hide a lot of important and interesting complexity.

Again, the KM statistics show that the differences between models seen in the KM plot are statistically significant and unlikely to have arisen by chance alone.

                                  N Observed Expected (O-E)^2/E (O-E)^2/V
model=HGST HMS5C4040ALE640     7168       73    285.7  1.58e+02  1.83e+02
model=HGST HMS5C4040BLE640     3115       21    194.6  1.55e+02  1.67e+02
model=Hitachi HDS5C3030ALA630  4662       98    519.4  3.42e+02  4.09e+02
model=Hitachi HDS5C4040ALE630  2719       63    298.9  1.86e+02  2.05e+02
model=Hitachi HDS722020ALA330  4774      175    530.7  2.38e+02  2.86e+02
model=Hitachi HDS723030ALA640  1048       45    115.3  4.29e+01  4.45e+01
model=ST3000DM001              4707     1705    305.5  6.41e+03  7.06e+03
model=ST31500341AS              787      216     45.1  6.47e+02  6.55e+02
model=ST31500541AS             2188      392    199.1  1.87e+02  1.98e+02
model=ST4000DM000             21671      695   1025.8  1.07e+02  1.52e+02
model=ST6000DX000              1906       26     27.6  9.20e-02  9.51e-02
model=WDC WD10EADS              550       53     54.7  5.38e-02  5.47e-02
model=WDC WD30EFRX             1267      114     73.6  2.22e+01  2.27e+01

 Chisq= 8587  on 12 degrees of freedom, p= 0 

More complex models:

The KM plot is a robust, non-parametric method which is attractive because of the lack of assumptions about the data. More sophisticated methods such as Cox proportional hazards models require distributional or other assumptions, but allow adjustment for additional variables such as the kind of storage pod (see the Backblaze blogs), drive capacity, number of platters or other factors of interest. My view is that this is not going to be at all useful until a lot more data becomes available. 

Conclusions:

Other than as a consumer, I don't have any particular expertise on hard disk drives but I have made a successful career out of interpreting large scale data sets using appropriate statistical methods. I find the KM analysis much more clear and easy to interpret compared to the simple descriptive statistics presented by Backblaze and I hope they use more appropriate methods going forward. I'm happy to help if anyone cares to ask.

Technical details and data source:

The Backblaze folk have done a great service to the community by making their data freely available for anyone willing to poke at it at https://www.backblaze.com/hard-drive-test-data.html.The data release which includes the third quarter of 2015 was downloaded in early February 2016 and is reported here.

Here's a small sample of the 30,301,566 rows of raw data available from Backblaze. There's a separate CSV format file for each day of each year. These are stored under three year (eg 2013) directories. This is from the start of  "2013/2013-04-10.csv"


date serial_number model capacity_bytes failure
2013-04-10 MJ0351YNG9Z0XA Hitachi HDS5C3030ALA630 3000592982016 0
2013-04-10 MJ0351YNG9WJSA Hitachi HDS5C3030ALA630 3000592982016 0
2013-04-10 MJ0351YNG9Z7LA Hitachi HDS5C3030ALA630 3000592982016 0
2013-04-10 MJ0351YNGAD37A Hitachi HDS5C3030ALA630 3000592982016 0

Since I don't trust the smartdrive stats, I threw all those columns away and split out the manufacturer code and model from the "model" field.

The Kaplan-Meier plot and test statistics are available in most worthwhile statistical packages and I used the npsurv function from the R survival package for the plots and statistics reported here. In order to improve the reliability of the model curves, drives with fewer than 500 observations were dropped.

A python script was used to read all the files, keeping track of the appearance and disappearance of each unique drive as defined by a combination of model and serial_number, while processing each day's data in sequence. No database needed - python easily handles this data as an in memory dictionary, after dropping all the smartdrive columns. After reading all 30 million rows, a summary file containing a single row for each unique drive with the date it first appeared, the number of days it was under observation and a code indicating whether it failed or not was written. That script processed about 30,000 csv rows a second on my oldish desktop taking about 17 minutes for the entire dataset. The R script takes only a few seconds to perform the KM analysis and generate plots.



Comments

  1. I'm interested in this phrase:
    "Since I don't trust the smartdrive stats, I threw all those columns away"

    Why not? Is there good statistical basis not to trust them? I generally find the uncorrectable error rates and realloc sectors to be good (if fuzzy) indicators of failure, but most of my data points are based on the spectacularly unreliable Seagate Barracudas.

    ReplyDelete
  2. Thanks for the comment.
    Three reasons why the smartdrive stats are not used here:

    1) I'm not smart enough to understand how they can be used to provide useful insight into this data
    2) I wanted the simplest models that provide interesting insight. Cox and other models incorporating continuous covariates are much harder to interpret
    3) I read some discouraging comments about the smartdrive stats in some of the backblaze data blogs.
    4) The smartdrive stats have far too much missing for me to trust them or to want to fool around with imputation to make KM or Cox models possible.

    Please feel free to share your own findings if they help make the data easier to understand.

    ReplyDelete
  3. Thank you for providing this analysis. I find the legends for the plot very hard to use. I think the usability of this information could be improved by adding a label on each of the lines.

    ReplyDelete
    Replies
    1. I also found the legend very difficult to read, and the colours too similar. I've just discovered your interesting post due to going through another research/buying cycle for a couple of new drives.

      Delete
  4. There's a legend which works for me but please feel free to send me a pull request to improve the plots so they'er labelled the way you prefer. Source at https://github.com/fubar2/backblazeKM

    ReplyDelete
    Replies
    1. GGally::ggsurv can add texts along the curves, which would be more readable for large plots IMHO. Thanks for your analysis.

      Delete
  5. Thank you for this interesting insight!

    I fully agree that your analysis is much easier to interpret and reveals more detail than the original analysis done by Backblaze. I have never used or seen KM-plots but it seems to be a very handy tool.

    Thanks again and thumbs up.

    ReplyDelete
  6. For Nathan and other people like me who strangle to distinguish the colours in the models plot: https://dl.dropboxusercontent.com/u/242368/km_model_feb2015_rl.png (I only spent time on the top survivors)

    ReplyDelete
    Replies
    1. Did you see the updates at http://bioinformare.blogspot.com.au/2016/05/survival-analysis-of-hard-disk-drive.html ? I reran the scripts with the Q1 2016 data added. More data = more reliable estimates.

      Delete
  7. Oh! and thanks for your effort and mostly the fine idea of using KM-plots for such cases Lazarus. Vastly better than the simple statistics we usually see.

    ReplyDelete
  8. Examining the nations of the brands is interesting. The Japanese brands perform better than the USA brands. So WD bought HGST, the best performer. HGST is now totally owned in every way now by WD, so will the worst performer now and the best performer move towards the mean?

    Outsiders like myself are wondering if and when South Korea and Chin will enter these charts. Unfortunately these charts do not cover the nations of manufacture of the products, ... yet.

    Ownership and brand-origin of the brands seem to show patterns in the above charts. I am guessing that all items are made in factories based in East Asia, including Thailand, Singapore, Vietnam & China? Perhaps the nation of final assembly of the metal units might show interesting patterns?

    In the developed nations like Australia (where we live now), USA, etc have lost most of our factory creativity. Will East Asia be able to better our abilities?

    ReplyDelete
  9. I can say as an end user I agree more with these graphical representations then backblaze, seagates are absolute garbage, lost everything I had

    ReplyDelete
  10. SMART was done to hide data rather than make it public. In the old days, when you bought drives they came with a defect list which you manually added into a bad sector table. Drive manufactures realized having a list of how many bad spots were on their products was not a good marketing move. So SMART was born to hide them. It was spun to give "early" warning. But I have never once seen a SMART alert warn of a drive failure. But plenty proclaim a drive fine that had problems. I have been sysadmin and network admin for more years that I care to admit so have seen plenty of drives (no where as many as backblaze but still enough)

    ReplyDelete
  11. Ross - really interesting analysis. I'll see if I can teach myself some R and replicate your work. I'm a geophysicist with a decent maths background so should be able to get into it.
    Quick question - can the plots be altered so the symbols (+) colour matches the line colour?

    it'll be interesting to see how the new 8TB Seagate drives behave. First decent numbers just in and the infant mortality rates are showing promise.

    ReplyDelete
    Replies
    1. Any chance seeing the K-M analysis including the 2016 data?

      Delete
  12. SVG graphs would make this a lot easier to read :|

    ReplyDelete
  13. Thanks for sharing the article. Keep sharing more with us.
    Here is something useful for all of you if you are searching for the best place for your home. BLfBhumi providing you the great varieties of Plots near Super Corridor Indore at very affordable rates.

    ReplyDelete
  14. Although I find this article fascinating, from a layman's viewpoint, could someone summarize please? I don't buy drives in vast quantities but as a consumer, I am interested in statistics that would indicate which brand of drive would give me the longest trouble-free lifetime of use.

    I believe that most consumers have experienced data loss at some point from drive failure (I certainly have) and even after researching anecdotal evidence and forum posts, I'm nowhere closer to knowing concretely (within reason) which drive I should put my faith in... and let's be honest about it, most consumers purchase on price alone, some will decide on performance and/or features. Many simply go with review recommendations which are vague at best so I say "faith" as in you're rolling the dice when you buy as opposed to making an informed decision.

    It seems from this analysis that Seagate tends to have a significantly higher failure rate than HGST, am I correct in this?

    Thank you for your time and thanks to the author for the fascinating analysis.

    ReplyDelete
  15. Very nice post, impressive. its quite different from other posts. Thanks for sharing.
    IT PRODUCT SALE IN BEST PRIZE

    ReplyDelete
  16. Awesome examination of their data!
    Thanks for posting it!

    ReplyDelete
  17. Bruno (and other laymen) -

    If there was a hard drive in the chart that had ZERO failures, then it would be easy to spot which brand/model to purchase in order to get the longest possible life. However, since a hard drive is a complex electro-mechanical component, you could purchase the best performing hard drive on the chart, and experience a failure the day it's installed, and that failure could either be mechanical or electrical, or even logical.

    The best you can do is purchase a hard drive that will meet your anticipated storage needs for the long haul, not necessarily the largest hard drive since those are usually fairly new to market and not fully field tested, and make sure that it comes with the longest warranty that you can afford, AND BACKUP your hard drive OFTEN. Enterprise hard drives tend to have the longest warranty and the highest price.

    In defense of Seagate, we have found that their "Enterprise-level hard drives" are the most reliable, but we purchase mostly 2TBs and smaller, with the occasional 4TB or 6TB; and Hitachi/HGST runs a close second in reliability.

    Don't be fooled into thinking that Solid State Drives/Flash Storage Cards are the answer either, we have found that those fail almost as often as electro-mechanical hard drives. When a Solid State Drive fails, it's also very difficult to recover the data from, if you have not backed it up.

    The best answer for whatever type, brand, or model hard drive you purchase is BACKUP OFTEN. A drive can be replaced, your important files, pictures, etc. can't.

    Hope this helps!
    Tim
    CORNICE
    Apple Authorized Service Provider

    ReplyDelete

  18. i really likes your blog and You have shared the whole concept really well. and Very beautifully written,
    soulful read! thanks for sharing.


    gclub casino
    goldenslot casino
    goldenslot

    ReplyDelete
  19. Excellent analysis! Obviously a lot of prep and forethought required to produce this high level of reporting.

    If someone can do it better, have at it.

    Semper vigilis.

    State College, Pennsylvania

    Solar Eclipse due on August 21.

    See MrEclipse.com,

    and get your pinhole cameras made well

    In advance.

    ReplyDelete
  20. Good blog and very useful blog for service people. Hard disk issue can cause boot disk failure error and system might not work. We are into Laptop service in Chennai and hard disk issues are very common.

    ReplyDelete
  21. Thank you very much for sharing such a beautiful article.

    ดูหนังออนไลน์

    ReplyDelete

Post a Comment

Popular posts from this blog

Survival analysis of hard disk drive failure data: Update to Q1 2016

Backblaze hard disk drive failure data: Update to Q2 2016