Sound Data Management Training » History » Version 86

« Previous - Version 86/110 (diff) - Next » - Current version
Steve Welburn, 2012-09-25 02:13 PM


WP1.2 Online Training Material

We consider three stages of a research project, and the appropriate research data management considerations for each of those stages. The stages are:

In addition, we consider the responsibilities of a Principal Investigator regarding data management.

Before The Research - Planning Research Data Management

A data management plan is an opportunity to think about the resources that will be required during the lifetime of the research project and to make sure that any necessary resources will be available for the project. In addition, it is likely that some form of data management plan will be required as part of a grant proposal.

The main questions the plan will cover are:
  • What type of storage do you require ?
    Do you need a lot of local disk space to store copies of standard datasets ? Will you be creating data which should be deposited in a long-term archive, or published online ? How will you back up your data ?
  • How much storage do you require ?
    Does it fit within the standard allocation for backed-up storage ?
  • How long will you require the storage for ?
    Is data being archived or published ? Does your funder require data publication ?
  • How will this storage be provided ?
Appropriate answers will relate to: Additional questions may include:
  • What is the appropriate license under which to publish data ?
  • Are there any ethical concerns relating to data management e.g. identifiable participants ?
  • Does your research data management plan comply with relevant legislation ?
    e.g. Data Protection, Intellectual Property and Freedom of Information

A minimal data management plan for a project using standard C4DM/QMUL facilities could say:

During the project, data will be created locally on researchers machines and will be backed up to the QMUL network. Software will be managed through the code.soundsoftware.ac.uk site which provides a Mercurial version control system and issue tracking. At the end of the project, software will be published through soundsoftware and data will be published on the C4DM Research Data Repository.

For larger proposals, a more complete plan may be required. The Digital Curation Centre have an online tool (DMP Online) for creating data management plans which asks (many) questions related to RCUK principles and builds a long-form plan to match research council requirements.

It is important to review the data management plan during the project as it is likely that actual requirements will differ from initial estimates. Reviewing the data management plan against actual data use will allow you to assess whether additional resources are required before resourcing becomes a critical issue.

In order to create an appropriate data management plan, it is necessary to consider data management requirements during and after the project.

The Digital Curation Centre (DCC) provide DMP Online, a tool for creating data management plans. The tool can provide a data management questionnaire based on institutional and funder templates and produce a data management plan from the responses. Documents are available describing how to use DMP Online.

During The Research

During the course of a piece of research, data management is largely risk mitigation - it makes your research more robust and allows you to continue if something goes wrong.

The two main areas to consider are:
  • backing up research data - in case you lose, or corrupt, the main copy of your data;
  • documenting data - in case you need to to return to it later.

In addition to the immediate benefits during research, applying good research data management practices makes it easier to manage your research data at the end of your research project.

We have identified three basic types of research projects, two quantitative (one based on new data, one based on a new algorithm) and one qualitative, and consider the data management techniques appropriate to those workflows. More complex research projects might require a combination of these techniques.

Quantitative research - New Data

For this use case, the research workflow involves:
  • creating a new dataset
  • testing outputs of existing algorithms on the dataset
  • publication of results
The new dataset might include:
  • selection or creation of underlying (audio) data (the actual audio might be in the dataset or the dataset might reference material - e.g. for copyright reasons)
  • creation of ground-truth annotations for the audio and the type of algorithm (e.g. chord sequences for chord estimation, onset times for onset detection)
Although the research is producing a single new dataset, the full set of research data involved includes:
  • software for the algorithms
  • the new dataset
  • identification of existing datasets against which results will be compared
  • results of applying the algorithms to the dataset
  • documentation of the testing methodology - e.g. method and algorithm parameters (including any default parameter values).

All of these should be documented and backed up.

Note that if existing algorithms have published results using the same existing datasets and methodology, then results should be directly comparable between the published results and the results for the new dataset. In this case, most of the methodology is already documented and only details specific to the new dataset need to be recorded separately.

If the testing is scripted, then the code used would be sufficient documentation during the research - readable documentation only being required at publication.

Quantitative research - New Algorithm

A common use-case in C4DM research is to run a newly-developed analysis algorithm on a set of audio examples and evaluate the algorithm by comparing its output with that of a human annotator. Results are then compared with published results using the same input data to determine whether the newly proposed approach makes any improvement on the state of the art.

Data involved includes:
  • software for the algorithm
  • an annotated dataset against which the algorithm can be tested
  • results of applying the new algorithm and competing algorithms to the dataset
  • documentation of the testing methodology

Note that if other algorithms have published results using the same dataset and methodology, then results should be directly comparable between the published results and the results for the new algorithm. In this case, most of the methodology is already documented and only details specific to the new algorithm (e.g. parameters) need to be recorded separately.

Also, if the testing is scripted, then the code used would be sufficient documentation during the research - readable documentation only being required at publication.

Qualitative research

An example would be using interviews with performers to evaluate a new instrument design.

The workflow is:
  • Gather data for the experiment (e.g. though interviews)
  • Analyse data
  • Publish data
Data involved might include:
  • the interface design
  • Captured audio from performances
  • Recorded interviews with performers (possibly audio or video)
  • Interview transcripts

Survey participants and interviewees retain copyright over their contributions unless they are specifically assigned to you! In order to have the freedom to publish the content a suitable rights waiver / transfer of copyright / clearance form / licence agreement should be signed. Or agreed on tape. Also, the people (or organisation) recording the event will have copyright on their materials... unless assigned/waived/licensed (e.g. video / photos / sound recordings). Most of this can be dealt with fairly informally for most research, but if you want to publish data then a more formal agreement is sensible. Rather than transferring copyright, an agreement to publish the (possibly edited) materials under a particular license might be appropriate.

Creators of materials (e.g. interviewees) always retain moral rights to their words: they have the right to be named as the author of their content; and they maintain the right to object to derogatory treatment of their material. Note that this means that in order to publish anonymised interviews, you should have an agreement that allows this.

If people are named in interviews (even if they're not the interviewee) then the Data Protection Act might be relevant.

The research might also involve:
  • Demographic details of participants
  • Identifiable participants (Data Protection)
  • Release forms for people taking part
and is likely to involve:

At The End Of The Research

Whether you have finished a research project or simply completed an identifiable unit of research (e.g. published a paper based on your research), you should look at:

Publication of the results of your research will require:
  • Summarising the results
  • Publishing a relevant sub-set of research data / summarised data to support your paper
  • Publishing the paper

Note that the EPSRC data management principles require sources of data to be referenced.

Research Management

The data management concerns of a PI will largely revolve around planning and appraisal of data management for research projects: to make sure that they conform with institutional policy and funder requirements; and to ensure that the data management needs of the research project are met.

A data management plan (e.g. for use in a grant proposal) will show that you have considered:
  • the costs of preserving your data;
  • funder requirements for data preservation and publication;
  • institutional data management policy
  • and ethical issues surrounding data management (e.g. data relating to human participants).
Specific areas to examine may include:

After the project is completed, an appraisal of how the data was managed should be carried out as part of the project's "lessons learned".

Data management training should provide an overview of all the above, and keep PIs informed of any changes in the above that affect data management requirements.

Why manage research data ?

Funder requirements: http://researchonline.lshtm.ac.uk/208596/

Ponemon reports for Intel on the "Lost Laptop problem" ~10% of Education and Research laptops are lost during their lifetime.

PC World study on laptop failure rates: 20-30% of laptops with a significant failure

Failure Trends In A Large Disk Drive Population

Identified ~20% of hard drives being replaced over 3 years!

FAST '07 paper on Failure Trends In A Large Disk Drive Population

Google report on over 100,000 consumer-grade disk drives from 80-400 GB produced in or after 2001 and used within Google. Data collected December 2005 - August 2006. Disk drives had a burn-in process and only those that were commissioned for use were included in the study - certain basic defects may well be excluded from this report.

the most accurate definition we can present of a failure event for our study is: a drive is considered to have failed if it was replaced as part of a repairs procedure. Note that this definition implicitly excludes drives that were replaced due to an upgrade.

~3% in first 3 months, ~2% up to 1 year, ~8% 2 years, ~9% 3 years, ~6% 4 years, ~7% 5 years

NB: Variation with model and manufacturer!

In the first 6 months, the risk of failure is highest for low & high utilisation!
  • ~10% for high utilisation in the first 3 months
  • for 3-year old drives ~4-5% chance of failure whatever the utilisation
  • failures are most likely at low drive temperatures (on start-up ?) i.e. < 25 deg. C
  • drives over 2 years old are most likely to fail at high temperatures (could be mode of failure ?)
Disks with SMART scan errors are 10 times more likely to fail - almost 30% of drives with a SMART scan error failed within 8 months of the error.
  • If a drive up to 8 months old gets a scan error, there's a 90% chance of it surviving at least 8 months
  • If a drive over 2 years old gets a scan error, there's a 60% chance of it surviving at least 8 months
  • If you have more than 1 scan error on a drive, it's significantly less likely to survive
  • Similar for SMART reallocation counts AFR almost 20% if reallocation occurs in first 3 months
  • ...but over 36% of failed drives had zero counts on all variables

Talagala and Patterson [20] perform a detailed error analysis of 368 SCSI disk drives over an eighteen month period, reporting a failure rate of 1.9%. Results on a larger number of desktop-class ATA drives under deployment at the Internet Archive are presented by Schwarz et al [17]. They report on a 2% failure rate for a population of 2489 disks during 2005, while mentioning that replacement rates have been as high as 6% in the past. Gray and van Ingen [9] cite observed failure rates ranging from 3.3-6% in two large web properties with 22,400 and 15,805 disks respectively. A recent study by Schroeder and Gibson [16] helps shed light into the statistical properties of disk drive failures. The study uses failure data from several large scale deployments, including a large number of SATA drives. They report a significant overestimation of mean time to failure by manufacturers and a lack of infant mortality effects. None of these user studies have attempted to correlate failures with SMART parameters or other environmental factors.

Hard drive manufacturers often quote yearly failure rates below 2% [2]
User studies have seen rates as high as 6% [9]

Between 15-60% of drives returned to manufacturers having been considered to have failed by users have no defect as far as the manufacturers are concerned [7]
Between 20-30% “no problem found” cases were observed after analyzing failed drives from a study of 3477 disks [11]

Failure rates are known to be highly correlated with drive models, manufacturers and vintages [18].

Overarching concerns

Human participation - ethics, data protection

Audio data - copyright

Storage - where ? how ? SLA ?

Short-term resilient storage for work-in-progress

Long-term archival storage for research data outputs

Curation of archived data - refreshing media and formats

Drivers - FoI, RCUK