Evidence Promoting Good Data Management » History » Version 35
Version 34 (Steve Welburn, 2012-11-12 02:33 PM) → Version 35/89 (Steve Welburn, 2012-11-12 02:37 PM)
h1. Evidence Promoting Good Data Management
{{>toc}}
If you have any additional examples that you would like to share, please email them to: rdm.c4dm at gmail.com
h2. Anecdotal Tales Of Lost Data
h3. Recovery of Overwritten Hard Disk Data
5 October 2005 Linux Forums - http://tinyurl.com/8t7uaop
<pre>
Hi, a friend of mine just overwrote two months of her
PhD thesis with an older version. I know recovery of
overwritten data is possible, but wonder if I'd need
special hardware to do it. Does anyone know something
about this ?
Thank You.
</pre>
h3. Stolen laptop had PhD research
19 March 2008 Surrey Leader - http://tinyurl.com/9hmtlv4
<pre>
Thirty-fve minutes spent in Langley’s Willowbrook
Shopping Centre cost a Surrey woman much more than
she had anticipated.
Langley RCMP say that while she was shopping from
1-1:35 p.m. last Monday, someone broke into her
vehicle and stole a number of items, including
a Mac iBook laptop containing the research she had
compiled as she worked towards her PhD.
“All that information was on that computer and she
has no back-up file,” said Langley RCMP spokesman
Cpl. Brenda Marshall.
</pre>
h3. Happiness is the return of a stolen computer, with data intact
27 May 2010 The Press, NZ - http://tinyurl.com/38sznnh
<pre>
Never has a man been so happy to see a computer full of data
spreadsheets.
Claudio De Sassi's world fell apart when a car containing almost three
years work towards his PhD was stolen two weeks ago.
De Sassi, a Canterbury University academic, could not hide his joy
yesterday as police reunited him with his stolen laptop and backpack.
</pre>
h3. Thugs steal Christmas, doctoral dreams
22 December 2010 KRQE - http://tinyurl.com/9a5j56f
<pre>
A tiny television sits where a big screen used to, and a Christmas tree
stands with little underneath it...
Even worse than the gifts, the crooks stole a MacBook Pro laptop and a
LaCie hard drive.
The hard drive had … her dissertation and nearly seven years of
research for her doctoral degree she was set to fnish in a few weeks.
Osuna had everything backed up on a separate hard drive in a safe, but
burglars made off with that too.
"All I could think about is that all that time is gone, all that effort,
everything is gone," Osuna said.
</pre>
h3. Laptop Stolen From OSU Doctoral Student
NBC4i January 06 2011 - http://tinyurl.com/bmybv9x
<pre>
...her car was broken into and her chrome Mac book pro was stolen.
She has a back-up for all but the last six months of research, but the
most important part of the research had happened recently.
</pre>
h2. The Lost Laptop Problem
* 2010 Ponemon Institute report for Intel re. US laptops
** On average, 2.3% of laptops assigned to employees are lost each year
** In education & research that rises to 3.7%, with 10.8% of laptops being lost before the end of their useful life
*** ~3 years i.e. within 1 PhD of allocation!
** 75% lost outside the workplace
* Very similar results from 2011 European report!
Intel 2010 - http://tinyurl.com/8c9m4bn
h2. Laptop Reliability
* 2011 PC World Laptop Reliability Survey from 63,000 readers:
** 22.6% had signifcant problems during the product's lifetime
** Of which...
*** 19% had OS problems ~1 in 25 of all laptops
*** 18% had HDD problems ~1 in 25 of all laptops
*** 10% PSU problems ~1 in 50 of all laptops
PC World 2011 - http://tinyurl.com/876qza5
h2. Hard Disk Failures
* Failure Trends In A Large Disk Drive Population
** Usenix conference on File and Storage Technologies 2007 (FAST '07)
** Eduardo Pinheiro & Wolf-Dietrich Weber, Google Inc.
* Data collected from over 100,000 disk drives at Google
* As part of repairs procedures:
** ~13% of disk drives replaced over 3 years
** ~20% of disk drives replaced over 4 years
Article: http://tinyurl.com/octz6b
h2. Data management in the cloud
See JISC/DCC document "Curation In The Cloud" - http://tinyurl.com/8nogtmv
Service agreements may give wide-ranging rights to the data service.
h3. Google Terms Of Service
1 March 2012 Google Terms of Service : http://tinyurl.com/89dc9fa
<pre>
When you upload or otherwise submit content to our Services, you give
Google (and those we work with) a worldwide license to use, host, store,
reproduce, modify, create derivative works (such as those resulting from
translations, adaptations or other changes we make so that your
content works better with our Services), communicate, publish, publicly
perform, publicly display and distribute such content. The rights you
grant in this license are for the limited purpose of operating, promoting,
and improving our Services, and to develop new ones. This license
continues even if you stop using our Services (for example, for a
business listing you have added to Google Maps).
</pre>
h3. Microsoft Services Agreement
19 October 2012 Microsoft services agreement : http://tinyurl.com/8e4kucy
<pre>
When you upload your content to the services, you agree that it may
be used, modifed, adapted, saved, reproduced, distributed, and
displayed to the extent necessary to protect you and to provide, protect
and improve Microsoft products and services. For example, we may
occasionally use automated means to isolate information from email,
chats, or photos in order to help detect and protect against spam and
malware, or to improve the services with new features that makes them
easier to use. When processing your content, Microsoft takes steps to
help preserve your privacy.
</pre>
h2. Archiving Data
h3. BBC Domesday Project
1986 Project to do a modern-day Domesday book (early crowd-sourcing)
* Used “BBC Master” computers with data on laserdisc
* Collected 147,819 pages of text and 23,225 photos
* Media expiring and obsolete technology put the data at risk!
Domesday Reloaded (2011)
* Required emulation of software
* Images restored from original masters
* http://www.bbc.co.uk/history/domesday
To allow long-term access to data
* Don't use obscure formats!
* Don't use obscure media!
* Don't rely on technology being available!
* Do keep original source material!
Google images for "BBC Domesday":https://www.google.co.uk/search?tbm=isch&q=bbc+domesday
h2. Sharing Data
"Sharing Detailed Research Data Is Associated with Increased Citation Rate":http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0000308
h2. Related Media
h3. Disk Drives Break
"DataCent collection of disk drive failure sounds":http://datacent.com/hard_drive_sounds.php
h3. Buildings burn down
"Southampton University Mountbatten Building Fire":http://www.flickr.com/search/?q=Southampton%20University%20Mountbatten%20Building%20Fire
h3. Laptops Break / Get Broken
* "Shot laptop":http://lilysussman.wordpress.com/tag/laptop-destroyed/
* "Google images of broken laptops":https://www.google.co.uk/search?q=broken%20laptop&um=1&tbm=isch
h1. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
h3. Failure Trends In A Large Disk Drive Population
Identified ~13% of hard drives being replaced over 3 years, 20% over 4 years as a result of a repair being required!
FAST '07 paper on "Failure Trends In A Large Disk Drive Population":https://www.usenix.org/conference/fast-07/failure-trends-large-disk-drive-population
Google report on over 100,000 consumer-grade disk drives from 80-400 GB produced in or after 2001 and used within Google. Data collected December 2005 - August 2006. Disk drives had a burn-in process and only those that were commissioned for use were included in the study - certain basic defects may well be excluded from this report. Also, discs were largely use in servers resulting in (relatively) large hours used relative to desktop / laptop computers.
bq. the most accurate definition we can present of a failure event for our study is: a drive is considered to have failed if it was replaced as part of a repairs procedure. Note that this definition implicitly excludes drives that were replaced due to an upgrade.
~3% in first 3 months, ~2% up to 1 year, ~8% @ 2 years, ~9% @ 3 years, ~6% @ 4 years, ~7% @ 5 years
NB: Variation with model and manufacturer!
In the first 6 months, the risk of failure is highest for low & high utilisation!
* ~10% for high utilisation in the first 3 months
* for 3-year old drives ~4-5% chance of failure whatever the utilisation
* failures are most likely at low drive temperatures (on start-up ?) i.e. < 25 deg. C
* drives over 2 years old are most likely to fail at high temperatures (could be mode of failure ?)
Disks with SMART scan errors are 10 times more likely to fail - almost 30% of drives with a SMART scan error failed within 8 months of the error.
* If a drive up to 8 months old gets a scan error, there's a 90% chance of it surviving at least 8 months
* If a drive over 2 years old gets a scan error, there's a 60% chance of it surviving at least 8 months
* If you have more than 1 scan error on a drive, it's significantly less likely to survive
* Similar for SMART reallocation counts AFR almost 20% if reallocation occurs in first 3 months
* ...but over 36% of failed drives had zero counts on all variables
bq. Talagala and Patterson [20] perform a detailed error analysis of 368 SCSI disk drives over an eighteen month period, reporting a failure rate of 1.9%. Results on a larger number of desktop-class ATA drives under deployment at the Internet Archive are presented by Schwarz et al [17]. They report on a 2% failure rate for a population of 2489 disks during 2005, while mentioning that replacement rates have been as high as 6% in the past. Gray and van Ingen [9] cite observed failure rates ranging from 3.3-6% in two large web properties with 22,400 and 15,805 disks respectively. A recent study by Schroeder and Gibson [16] helps shed light into the statistical properties of disk drive failures. The study uses failure data from several large scale deployments, including a large number of SATA drives. They report a significant overestimation of mean time to failure by manufacturers and a lack of infant mortality effects. None of these user studies have attempted to correlate failures with SMART parameters or other environmental factors.
Hard drive manufacturers often quote yearly failure rates below 2% [2]
User studies have seen rates as high as 6% [9]
Between 15-60% of drives returned to manufacturers having been considered to have failed by users have no defect as far as the manufacturers are concerned [7]
Between 20-30% “no problem found” cases were observed after analyzing failed drives from a study of 3477 disks [11]
Failure rates are known to be highly correlated with drive models, manufacturers and vintages [18].
h2. More To Read
Schroeder, Bianca, and Garth A. Gibson. "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you.":http://www.usenix.org/event/fast07/tech/schroeder/schroeder.pdf
Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). 2007.
Lancaster, Larry, and Alan Rowe. "Measuring Real World Data Availability.":http://static.usenix.org/publications/library/proceedings/lisa2001/tech/full_papers/lancaster/lancaster_html/
Proceedings of the LISA 2001 15th Systems Administration Conference. 2001.
McCullough, Bruce D., Kerry Anne McGeary, and Teresa D. Harrison. "Lessons from the JMCB Archive.":http://muse.jhu.edu/journals/mcb/summary/v038/38.4mccullough.html
Journal of Money, Credit, and Banking 38.4 (2006): 1093-1107.
Gleditsch, N.P., C. Metelits and H. Strand. 2003. Posting your data: Will you be scooped or will you be famous?.
Int. Stud. Perspect. 4:89–97.
Freckleton, R.P., P. Hulme, P. Giller and G. Kerby. 2005. "The changing face of applied ecology.":http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2005.00969.x/full
J. Appl. Ecol. 42:1–3.
Albers, S. "Editorial: Well Documented Articles Achieve More Impact":http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1568022
BuR Business Research Journal, Vol. 2, No.2, May 2009
Anderson, Richard G., et al. "The role of data/code archives in the future of economic research.":http://www.tandfonline.com/doi/abs/10.1080/13501780801915574
Journal of Economic Methodology 15.1 (2008): 99-119.
Evanschitzky, Heiner, et al. "Replication research's disturbing trend.":http://www.sciencedirect.com/science/article/pii/S0148296306002347
Journal of Business Research 60.4 (2007): 411-415.
{{>toc}}
If you have any additional examples that you would like to share, please email them to: rdm.c4dm at gmail.com
h2. Anecdotal Tales Of Lost Data
h3. Recovery of Overwritten Hard Disk Data
5 October 2005 Linux Forums - http://tinyurl.com/8t7uaop
<pre>
Hi, a friend of mine just overwrote two months of her
PhD thesis with an older version. I know recovery of
overwritten data is possible, but wonder if I'd need
special hardware to do it. Does anyone know something
about this ?
Thank You.
</pre>
h3. Stolen laptop had PhD research
19 March 2008 Surrey Leader - http://tinyurl.com/9hmtlv4
<pre>
Thirty-fve minutes spent in Langley’s Willowbrook
Shopping Centre cost a Surrey woman much more than
she had anticipated.
Langley RCMP say that while she was shopping from
1-1:35 p.m. last Monday, someone broke into her
vehicle and stole a number of items, including
a Mac iBook laptop containing the research she had
compiled as she worked towards her PhD.
“All that information was on that computer and she
has no back-up file,” said Langley RCMP spokesman
Cpl. Brenda Marshall.
</pre>
h3. Happiness is the return of a stolen computer, with data intact
27 May 2010 The Press, NZ - http://tinyurl.com/38sznnh
<pre>
Never has a man been so happy to see a computer full of data
spreadsheets.
Claudio De Sassi's world fell apart when a car containing almost three
years work towards his PhD was stolen two weeks ago.
De Sassi, a Canterbury University academic, could not hide his joy
yesterday as police reunited him with his stolen laptop and backpack.
</pre>
h3. Thugs steal Christmas, doctoral dreams
22 December 2010 KRQE - http://tinyurl.com/9a5j56f
<pre>
A tiny television sits where a big screen used to, and a Christmas tree
stands with little underneath it...
Even worse than the gifts, the crooks stole a MacBook Pro laptop and a
LaCie hard drive.
The hard drive had … her dissertation and nearly seven years of
research for her doctoral degree she was set to fnish in a few weeks.
Osuna had everything backed up on a separate hard drive in a safe, but
burglars made off with that too.
"All I could think about is that all that time is gone, all that effort,
everything is gone," Osuna said.
</pre>
h3. Laptop Stolen From OSU Doctoral Student
NBC4i January 06 2011 - http://tinyurl.com/bmybv9x
<pre>
...her car was broken into and her chrome Mac book pro was stolen.
She has a back-up for all but the last six months of research, but the
most important part of the research had happened recently.
</pre>
h2. The Lost Laptop Problem
* 2010 Ponemon Institute report for Intel re. US laptops
** On average, 2.3% of laptops assigned to employees are lost each year
** In education & research that rises to 3.7%, with 10.8% of laptops being lost before the end of their useful life
*** ~3 years i.e. within 1 PhD of allocation!
** 75% lost outside the workplace
* Very similar results from 2011 European report!
Intel 2010 - http://tinyurl.com/8c9m4bn
h2. Laptop Reliability
* 2011 PC World Laptop Reliability Survey from 63,000 readers:
** 22.6% had signifcant problems during the product's lifetime
** Of which...
*** 19% had OS problems ~1 in 25 of all laptops
*** 18% had HDD problems ~1 in 25 of all laptops
*** 10% PSU problems ~1 in 50 of all laptops
PC World 2011 - http://tinyurl.com/876qza5
h2. Hard Disk Failures
* Failure Trends In A Large Disk Drive Population
** Usenix conference on File and Storage Technologies 2007 (FAST '07)
** Eduardo Pinheiro & Wolf-Dietrich Weber, Google Inc.
* Data collected from over 100,000 disk drives at Google
* As part of repairs procedures:
** ~13% of disk drives replaced over 3 years
** ~20% of disk drives replaced over 4 years
Article: http://tinyurl.com/octz6b
h2. Data management in the cloud
See JISC/DCC document "Curation In The Cloud" - http://tinyurl.com/8nogtmv
Service agreements may give wide-ranging rights to the data service.
h3. Google Terms Of Service
1 March 2012 Google Terms of Service : http://tinyurl.com/89dc9fa
<pre>
When you upload or otherwise submit content to our Services, you give
Google (and those we work with) a worldwide license to use, host, store,
reproduce, modify, create derivative works (such as those resulting from
translations, adaptations or other changes we make so that your
content works better with our Services), communicate, publish, publicly
perform, publicly display and distribute such content. The rights you
grant in this license are for the limited purpose of operating, promoting,
and improving our Services, and to develop new ones. This license
continues even if you stop using our Services (for example, for a
business listing you have added to Google Maps).
</pre>
h3. Microsoft Services Agreement
19 October 2012 Microsoft services agreement : http://tinyurl.com/8e4kucy
<pre>
When you upload your content to the services, you agree that it may
be used, modifed, adapted, saved, reproduced, distributed, and
displayed to the extent necessary to protect you and to provide, protect
and improve Microsoft products and services. For example, we may
occasionally use automated means to isolate information from email,
chats, or photos in order to help detect and protect against spam and
malware, or to improve the services with new features that makes them
easier to use. When processing your content, Microsoft takes steps to
help preserve your privacy.
</pre>
h2. Archiving Data
h3. BBC Domesday Project
1986 Project to do a modern-day Domesday book (early crowd-sourcing)
* Used “BBC Master” computers with data on laserdisc
* Collected 147,819 pages of text and 23,225 photos
* Media expiring and obsolete technology put the data at risk!
Domesday Reloaded (2011)
* Required emulation of software
* Images restored from original masters
* http://www.bbc.co.uk/history/domesday
To allow long-term access to data
* Don't use obscure formats!
* Don't use obscure media!
* Don't rely on technology being available!
* Do keep original source material!
Google images for "BBC Domesday":https://www.google.co.uk/search?tbm=isch&q=bbc+domesday
h2. Sharing Data
"Sharing Detailed Research Data Is Associated with Increased Citation Rate":http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0000308
h2. Related Media
h3. Disk Drives Break
"DataCent collection of disk drive failure sounds":http://datacent.com/hard_drive_sounds.php
h3. Buildings burn down
"Southampton University Mountbatten Building Fire":http://www.flickr.com/search/?q=Southampton%20University%20Mountbatten%20Building%20Fire
h3. Laptops Break / Get Broken
* "Shot laptop":http://lilysussman.wordpress.com/tag/laptop-destroyed/
* "Google images of broken laptops":https://www.google.co.uk/search?q=broken%20laptop&um=1&tbm=isch
h1. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
h3. Failure Trends In A Large Disk Drive Population
Identified ~13% of hard drives being replaced over 3 years, 20% over 4 years as a result of a repair being required!
FAST '07 paper on "Failure Trends In A Large Disk Drive Population":https://www.usenix.org/conference/fast-07/failure-trends-large-disk-drive-population
Google report on over 100,000 consumer-grade disk drives from 80-400 GB produced in or after 2001 and used within Google. Data collected December 2005 - August 2006. Disk drives had a burn-in process and only those that were commissioned for use were included in the study - certain basic defects may well be excluded from this report. Also, discs were largely use in servers resulting in (relatively) large hours used relative to desktop / laptop computers.
bq. the most accurate definition we can present of a failure event for our study is: a drive is considered to have failed if it was replaced as part of a repairs procedure. Note that this definition implicitly excludes drives that were replaced due to an upgrade.
~3% in first 3 months, ~2% up to 1 year, ~8% @ 2 years, ~9% @ 3 years, ~6% @ 4 years, ~7% @ 5 years
NB: Variation with model and manufacturer!
In the first 6 months, the risk of failure is highest for low & high utilisation!
* ~10% for high utilisation in the first 3 months
* for 3-year old drives ~4-5% chance of failure whatever the utilisation
* failures are most likely at low drive temperatures (on start-up ?) i.e. < 25 deg. C
* drives over 2 years old are most likely to fail at high temperatures (could be mode of failure ?)
Disks with SMART scan errors are 10 times more likely to fail - almost 30% of drives with a SMART scan error failed within 8 months of the error.
* If a drive up to 8 months old gets a scan error, there's a 90% chance of it surviving at least 8 months
* If a drive over 2 years old gets a scan error, there's a 60% chance of it surviving at least 8 months
* If you have more than 1 scan error on a drive, it's significantly less likely to survive
* Similar for SMART reallocation counts AFR almost 20% if reallocation occurs in first 3 months
* ...but over 36% of failed drives had zero counts on all variables
bq. Talagala and Patterson [20] perform a detailed error analysis of 368 SCSI disk drives over an eighteen month period, reporting a failure rate of 1.9%. Results on a larger number of desktop-class ATA drives under deployment at the Internet Archive are presented by Schwarz et al [17]. They report on a 2% failure rate for a population of 2489 disks during 2005, while mentioning that replacement rates have been as high as 6% in the past. Gray and van Ingen [9] cite observed failure rates ranging from 3.3-6% in two large web properties with 22,400 and 15,805 disks respectively. A recent study by Schroeder and Gibson [16] helps shed light into the statistical properties of disk drive failures. The study uses failure data from several large scale deployments, including a large number of SATA drives. They report a significant overestimation of mean time to failure by manufacturers and a lack of infant mortality effects. None of these user studies have attempted to correlate failures with SMART parameters or other environmental factors.
Hard drive manufacturers often quote yearly failure rates below 2% [2]
User studies have seen rates as high as 6% [9]
Between 15-60% of drives returned to manufacturers having been considered to have failed by users have no defect as far as the manufacturers are concerned [7]
Between 20-30% “no problem found” cases were observed after analyzing failed drives from a study of 3477 disks [11]
Failure rates are known to be highly correlated with drive models, manufacturers and vintages [18].
h2. More To Read
Schroeder, Bianca, and Garth A. Gibson. "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you.":http://www.usenix.org/event/fast07/tech/schroeder/schroeder.pdf
Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST). 2007.
Lancaster, Larry, and Alan Rowe. "Measuring Real World Data Availability.":http://static.usenix.org/publications/library/proceedings/lisa2001/tech/full_papers/lancaster/lancaster_html/
Proceedings of the LISA 2001 15th Systems Administration Conference. 2001.
McCullough, Bruce D., Kerry Anne McGeary, and Teresa D. Harrison. "Lessons from the JMCB Archive.":http://muse.jhu.edu/journals/mcb/summary/v038/38.4mccullough.html
Journal of Money, Credit, and Banking 38.4 (2006): 1093-1107.
Gleditsch, N.P., C. Metelits and H. Strand. 2003. Posting your data: Will you be scooped or will you be famous?.
Int. Stud. Perspect. 4:89–97.
Freckleton, R.P., P. Hulme, P. Giller and G. Kerby. 2005. "The changing face of applied ecology.":http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2005.00969.x/full
J. Appl. Ecol. 42:1–3.
Albers, S. "Editorial: Well Documented Articles Achieve More Impact":http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1568022
BuR Business Research Journal, Vol. 2, No.2, May 2009
Anderson, Richard G., et al. "The role of data/code archives in the future of economic research.":http://www.tandfonline.com/doi/abs/10.1080/13501780801915574
Journal of Economic Methodology 15.1 (2008): 99-119.
Evanschitzky, Heiner, et al. "Replication research's disturbing trend.":http://www.sciencedirect.com/science/article/pii/S0148296306002347
Journal of Business Research 60.4 (2007): 411-415.