Failure Trends In A Large Disk Drive Population » History » Version 1
Steve Welburn, 2012-11-12 02:38 PM
1 | 1 | Steve Welburn | h1. Failure Trends In A Large Disk Drive Population |
---|---|---|---|
2 | 1 | Steve Welburn | |
3 | 1 | Steve Welburn | Identified ~13% of hard drives being replaced over 3 years, 20% over 4 years as a result of a repair being required! |
4 | 1 | Steve Welburn | |
5 | 1 | Steve Welburn | FAST '07 paper on "Failure Trends In A Large Disk Drive Population":https://www.usenix.org/conference/fast-07/failure-trends-large-disk-drive-population |
6 | 1 | Steve Welburn | |
7 | 1 | Steve Welburn | Google report on over 100,000 consumer-grade disk drives from 80-400 GB produced in or after 2001 and used within Google. Data collected December 2005 - August 2006. Disk drives had a burn-in process and only those that were commissioned for use were included in the study - certain basic defects may well be excluded from this report. Also, discs were largely use in servers resulting in (relatively) large hours used relative to desktop / laptop computers. |
8 | 1 | Steve Welburn | |
9 | 1 | Steve Welburn | bq. the most accurate definition we can present of a failure event for our study is: a drive is considered to have failed if it was replaced as part of a repairs procedure. Note that this definition implicitly excludes drives that were replaced due to an upgrade. |
10 | 1 | Steve Welburn | |
11 | 1 | Steve Welburn | ~3% in first 3 months, ~2% up to 1 year, ~8% @ 2 years, ~9% @ 3 years, ~6% @ 4 years, ~7% @ 5 years |
12 | 1 | Steve Welburn | |
13 | 1 | Steve Welburn | NB: Variation with model and manufacturer! |
14 | 1 | Steve Welburn | |
15 | 1 | Steve Welburn | In the first 6 months, the risk of failure is highest for low & high utilisation! |
16 | 1 | Steve Welburn | * ~10% for high utilisation in the first 3 months |
17 | 1 | Steve Welburn | * for 3-year old drives ~4-5% chance of failure whatever the utilisation |
18 | 1 | Steve Welburn | * failures are most likely at low drive temperatures (on start-up ?) i.e. < 25 deg. C |
19 | 1 | Steve Welburn | * drives over 2 years old are most likely to fail at high temperatures (could be mode of failure ?) |
20 | 1 | Steve Welburn | |
21 | 1 | Steve Welburn | Disks with SMART scan errors are 10 times more likely to fail - almost 30% of drives with a SMART scan error failed within 8 months of the error. |
22 | 1 | Steve Welburn | * If a drive up to 8 months old gets a scan error, there's a 90% chance of it surviving at least 8 months |
23 | 1 | Steve Welburn | * If a drive over 2 years old gets a scan error, there's a 60% chance of it surviving at least 8 months |
24 | 1 | Steve Welburn | * If you have more than 1 scan error on a drive, it's significantly less likely to survive |
25 | 1 | Steve Welburn | * Similar for SMART reallocation counts AFR almost 20% if reallocation occurs in first 3 months |
26 | 1 | Steve Welburn | * ...but over 36% of failed drives had zero counts on all variables |
27 | 1 | Steve Welburn | |
28 | 1 | Steve Welburn | |
29 | 1 | Steve Welburn | Hard drive manufacturers often quote yearly failure rates below 2% [2] |
30 | 1 | Steve Welburn | User studies have seen rates as high as 6% [9] |
31 | 1 | Steve Welburn | |
32 | 1 | Steve Welburn | Between 15-60% of drives returned to manufacturers having been considered to have failed by users have no defect as far as the manufacturers are concerned [7] |
33 | 1 | Steve Welburn | Between 20-30% “no problem found” cases were observed after analyzing failed drives from a study of 3477 disks [11] |
34 | 1 | Steve Welburn | |
35 | 1 | Steve Welburn | Failure rates are known to be highly correlated with drive models, manufacturers and vintages [18]. |
36 | 1 | Steve Welburn | |
37 | 1 | Steve Welburn | bq. Talagala and Patterson [20] perform a detailed error analysis of 368 SCSI disk drives over an eighteen month period, reporting a failure rate of 1.9%. Results on a larger number of desktop-class ATA drives under deployment at the Internet Archive are presented by Schwarz et al [17]. They report on a 2% failure rate for a population of 2489 disks during 2005, while mentioning that replacement rates have been as high as 6% in the past. Gray and van Ingen [9] cite observed failure rates ranging from 3.3-6% in two large web properties with 22,400 and 15,805 disks respectively. A recent study by Schroeder and Gibson [16] helps shed light into the statistical properties of disk drive failures. The study uses failure data from several large scale deployments, including a large number of SATA drives. They report a significant overestimation of mean time to failure by manufacturers and a lack of infant mortality effects. None of these user studies have attempted to correlate failures with SMART parameters or other environmental factors. |
38 | 1 | Steve Welburn | |
39 | 1 | Steve Welburn | Talagala, Nisha, and David Patterson. "An analysis of error behavior in a large storage system.":http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.1997&rep=rep1&type=pdf |
40 | 1 | Steve Welburn | Computer Science Division, University of California, 1999. |