Survival analysis of hard disk drive failure data: Update to Q1 2016

May 18, 2016

Ross Lazarus, May 2016

This is an update to http://bioinformare.blogspot.com.au/2016/02/survival-analysis-of-hard-disk-drive.html now that additional data for Q1 2016 has been released from https://www.backblaze.com/b2/hard-drive-test-data.html.
I reran my scripts and got the plots shown below. Whole process only takes a few minutes.

For me, the interesting thing is that so little really changes in the KM curves and statistics with 10% more data, suggesting that this statistical approach is reliable and robust, although in general we expect that more data provides better resolution.

The WD30-EFRX and WD10-EADS and drives are reordered in terms of failure risk with more data down near the middle of the pack, but the updated models KM curves otherwise suggest the same pattern of risk of failure over time. Hitachi and HGST have reversed their positions at the top of the manufacturer survival curves as a result of the additional data, but the other manufacturers remain largely unchanged.

In terms of the KM statistical tests, additional data confirms the earlier inference that there are significant differences between the manufacturer and model risk profiles over time.

Call:
survdiff(formula = sm ~ model, data = dm, rho = 0)

N Observed Expected (O-E)^2/E (O-E)^2/V
model=HGST HMS5C4040ALE640 7168 83 473.2 321.78 376.18
model=HGST HMS5C4040BLE640 3115 21 231.4 191.29 205.14
model=Hitachi HDS5C3030ALA630 4664 106 458.0 270.51 313.62
model=Hitachi HDS5C4040ALE630 2719 70 263.4 141.98 153.89
model=Hitachi HDS722020ALA330 4774 195 466.4 157.94 183.39
model=Hitachi HDS723030ALA640 1048 47 101.7 29.42 30.35
model=ST3000DM001 4707 1705 258.4 8100.00 8753.25
model=ST31500341AS 787 216 37.8 839.35 848.18
model=ST31500541AS 2188 392 166.0 307.66 321.95
model=ST4000DM000 35858 895 1302.4 127.45 195.39
model=ST500LM012 HN 656 24 17.2 2.70 2.71
model=ST6000DX000 1909 26 57.4 17.19 17.76
model=WDC WD10EADS 550 59 47.5 2.78 2.81
model=WDC WD30EFRX 1280 124 82.2 21.26 21.72

Chisq= 10647 on 13 degrees of freedom, p= 0
Call:
survdiff(formula = s ~ manufact, data = ds, rho = 0)

N Observed Expected (O-E)^2/E (O-E)^2/V
manufact=HGST 10449 110 740.5 536.857 668.750
manufact=Hitachi 13246 422 1380.6 665.600 1060.350
manufact=HN 656 24 18.2 1.885 1.896
manufact=ST 46909 3507 2032.7 1069.209 2056.135
manufact=TOSHIBA 255 10 11.3 0.154 0.155
manufact=WDC 3838 342 231.7 52.560 55.539

Chisq= 2420 on 5 degrees of freedom, p= 0

Here are the updated curves:

Comments

UnknownMay 19, 2016 at 8:02 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMay 19, 2016 at 8:04 AM
Your approach is only reliable and robust if you have actual experience in the HDD segment and understand how flawed the data set from Backblaze is, I can list the reasons if you are interested. Also, comparing a sample size of 20,000 or 30,000 drives to a sample size of either 250(Toshiba) or 2-3K(WD) is terrible practice for "reliable" statistical results.(I really wanted to use some harsher words here, but I will not).
ReplyDelete
Replies
tymbrimiMay 20, 2016 at 1:16 PM
Why do some of the curves level out to flat at the end, even though there seem to be data points indicating failures in the flat portions?
ReplyDelete
Replies
UnknownMay 24, 2016 at 8:29 AM
Hello, sorry for my ignorance, what is that V in (O-E)^2/V? Is the expected values calculated only from the backblaze dataset? What can we understand from the high normalized (O-E)^2/E values? That (for example) HGST drives have unpredictable failure rates? What's the meaning of Chisquared (on n degrees of freedom)? Is a censored drive a drive removed from the analysis or even a drive pulled out of a server and then put back inside?
Thanks for your insight, in this article and your previous one, I've never read anything that takes this approach to hdd failure rates, I agree this representation is much more interesting or probably more correct (eg aligning each device observation history in t=0, instead of looking at time frames, which, after seeing your article, IMHO really does not make any sense)
ReplyDelete
Replies
Ross LazarusMay 24, 2016 at 3:51 PM
V represents the variance of (obs - expected) so that's the log rank test for curve differences - numerically it varies slightly from the chisquared test and is also a valid test. You will probably want to read up on the method. Thishttps://stat.ethz.ch/education/semesters/ss2011/seminar/contents/presentation_2.pdf has a pretty good explanation but sadly it's still statistics
:)
ReplyDelete
Replies
Jack JanssenSeptember 16, 2016 at 12:19 AM
Thanks for this in-depth analysis, detailed description of your arguments and your update (will there be more?). I totally agree with you that this improves the view on the reliability of the hard disks. Also tribute to Backblaze because of publishing their data and conclusions.

One thing is however is not clear to me. About the HDDs which have been replaced by Backblaze because the smart statistics showed values above Backblazes thresholds (so they would probably fail soon): are they considered as 'failure' or as 'censored' (you said: "I don't trust the smartdrive stats"). I hope they are handled as 'failure' because it's like a patient who is sent home alive with the message that he will die soon. Whether these disks are actually broken or not doesn't matter anymore, for Backblaze their life is over.

Jack

PS Sorry for my English. it's not my native language.
ReplyDelete
Replies
philMay 23, 2017 at 4:58 PM
Have you 1) published your scripts? 2) looked at 2017 data?
ReplyDelete
Replies
Ross LazarusMay 23, 2017 at 5:15 PM
Source was published at https://github.com/fubar2/backblazeKM

Pull requests welcomed.

Yes, I have run the most up to date data. Very little change gratifyingly enough.

ReplyDelete
Replies
Ross LazarusMay 23, 2017 at 6:36 PM
Dunno.
I started parsing those smartdrive stats but there was mucho missing data so I chose to go with the simplest definition of survival - but yes, there must be confounding - OTOH they retire drive pods for other reasons so this all is rather inexact. Never mind the quality, feel the width comes to mind.
ReplyDelete
Replies
GclubAugust 31, 2019 at 3:24 AM
Good
...................
goldenslot
golden slot
ReplyDelete
Replies

Add comment

Search This Blog

informare