Failure Trends In A Large Disk Drive Population » History » Version 1

Steve Welburn, 2012-11-12 02:38 PM

1 1 Steve Welburn
h1. Failure Trends In A Large Disk Drive Population
2 1 Steve Welburn
3 1 Steve Welburn
Identified ~13% of hard drives being replaced over 3 years, 20% over 4 years as a result of a repair being required!
4 1 Steve Welburn
5 1 Steve Welburn
FAST '07 paper on "Failure Trends In A Large Disk Drive Population":https://www.usenix.org/conference/fast-07/failure-trends-large-disk-drive-population
6 1 Steve Welburn
7 1 Steve Welburn
Google report on over 100,000 consumer-grade disk drives from 80-400 GB produced in or after 2001 and used within Google. Data collected December 2005 - August 2006. Disk drives had a burn-in process and only those that were commissioned for use were included in the study - certain basic defects may well be excluded from this report. Also, discs were largely use in servers resulting in (relatively) large hours used relative to desktop / laptop computers.
8 1 Steve Welburn
9 1 Steve Welburn
bq. the most accurate definition we can present of a failure event for our study is: a drive is considered to have failed if it was replaced as part of a repairs procedure. Note that this definition implicitly excludes drives that were replaced due to an upgrade.
10 1 Steve Welburn
11 1 Steve Welburn
~3% in first 3 months, ~2% up to 1 year, ~8% @ 2 years, ~9% @ 3 years, ~6% @ 4 years, ~7% @ 5 years
12 1 Steve Welburn
13 1 Steve Welburn
NB: Variation with model and manufacturer!
14 1 Steve Welburn
15 1 Steve Welburn
In the first 6 months, the risk of failure is highest for low & high utilisation!
16 1 Steve Welburn
* ~10% for high utilisation in the first 3 months
17 1 Steve Welburn
* for 3-year old drives ~4-5% chance of failure whatever the utilisation
18 1 Steve Welburn
* failures are most likely at low drive temperatures (on start-up ?) i.e. < 25 deg. C
19 1 Steve Welburn
* drives over 2 years old are most likely to fail at high temperatures (could be mode of failure ?)
20 1 Steve Welburn
21 1 Steve Welburn
Disks with SMART scan errors are 10 times more likely to fail - almost 30% of drives with a SMART scan error failed within 8 months of the error.
22 1 Steve Welburn
* If a drive up to 8 months old gets a scan error, there's a 90% chance of it surviving at least 8 months
23 1 Steve Welburn
* If a drive over 2 years old gets a scan error, there's a 60% chance of it surviving at least 8 months
24 1 Steve Welburn
* If you have more than 1 scan error on a drive, it's significantly less likely to survive
25 1 Steve Welburn
* Similar for SMART reallocation counts AFR almost 20% if reallocation occurs in first 3 months
26 1 Steve Welburn
* ...but over 36% of failed drives had zero counts on all variables
27 1 Steve Welburn
28 1 Steve Welburn
29 1 Steve Welburn
Hard drive manufacturers often quote yearly failure rates below 2% [2]
30 1 Steve Welburn
User studies have seen rates as high as 6% [9]
31 1 Steve Welburn
32 1 Steve Welburn
Between 15-60% of drives returned to manufacturers having been considered to have failed by users have no defect as far as the manufacturers are concerned [7]
33 1 Steve Welburn
Between 20-30% “no problem found” cases were observed after analyzing failed drives from a study of 3477 disks [11]
34 1 Steve Welburn
35 1 Steve Welburn
Failure rates are known to be highly correlated with drive models, manufacturers and vintages [18].
36 1 Steve Welburn
37 1 Steve Welburn
bq. Talagala and Patterson [20] perform a detailed error analysis of 368 SCSI disk drives over an eighteen month period, reporting a failure rate of 1.9%. Results on a larger number of desktop-class ATA drives under deployment at the Internet Archive are presented by Schwarz et al [17]. They report on a 2% failure rate for a population of 2489 disks during 2005, while mentioning that replacement rates have been as high as 6% in the past. Gray and van Ingen [9] cite observed failure rates ranging from 3.3-6% in two large web properties with 22,400 and 15,805 disks respectively. A recent study by Schroeder and Gibson [16] helps shed light into the statistical properties of disk drive failures. The study uses failure data from several large scale deployments, including a large number of SATA drives. They report a significant overestimation of mean time to failure by manufacturers and a lack of infant mortality effects. None of these user studies have attempted to correlate failures with SMART parameters or other environmental factors.
38 1 Steve Welburn
39 1 Steve Welburn
Talagala, Nisha, and David Patterson. "An analysis of error behavior in a large storage system.":http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.1997&rep=rep1&type=pdf
40 1 Steve Welburn
Computer Science Division, University of California, 1999.