Archiving research data » History » Version 26
Version 25 (Steve Welburn, 2013-01-08 12:19 PM) → Version 26/28 (Steve Welburn, 2013-01-08 02:52 PM)
h1. Archiving research data
For archival purposes data needs to be stored in a location which provides facilities for long-term preservation of data. As well as standard data management concerns (e.g. backup, documentation) the media and the file formats will need to be appropriate for long-term use.
Whereas work-in-progress data is expected to change regularly during the research process, archived data will change rarely, if at all. Archived data can therefore be stored on write-once media (e.g. CD-R).
In addition, it is not necessary to archive all intermediate results - reuse of archived data mean that requiring a few days to regenerate results is reasonable. However, all necessary documentation, software and data should be archived to allow results to be recreated. Existing archived datasets will not need archiving "again". However, if the archiving system supports deduplication then storing multiple copies of the same content will require minimal additional storage.
Once archived, the archive copy should not be modified directly and data access should only be required to create a new work-in-progress copy of the data to work from. Access to archived data will therefore be sporadic. Hence, it is possible to store archived data "off-line" only to be accessed when required.
It is important that archiving data is performed in an appropriate manner to allow future use of the data. This will require the use of appropriate formats for the data and storage on suitable media.
If the original content is not in an open format, then providing copies in multiple formats may be appropriate - e.g. an original Microsoft Word document, a PDF version to show how the document should look and the plain-text content so the document can be recreated.
Within C4DM, there are currently few [[Data_management_resources|resources]] available to support this. The best available option is the research group network folder as this is backed up to tape.
{{include(Archiving_Properly)}}
h2. Media
Archive copies of data may be held on the same types of media as used during research. Additionally, Write-Once media (e.g. CD-R, DVD+/-R, BDR) may be appropriate.
Removable drives (e.g. USB flash drives, firewire HDD) may be used, but there is a risk of hardware failure with these devices - they are not "just" data storage.
Removable media (e.g. CD-R, tapes) do not have the risk of hardware failure but the media themselves may be damaged or become unusable - the estimated lifetime of an optical disc is 2-100 years. Whether a specific disc will last 2 years or 100 is not something that can easily be judged - although buying high quality media rather than cheap packs of 100 discs may help.
As with all technology, there is a risk of obsolescence
* devices to read removable media may no longer be commonplace (e.g. floppy disc drives, ZIP drives)
* formats used for removable media may no longer be supported (e.g. various formats for DVD-RAM discs)
* interfaces used for removable drives may no longer be commonplace (e.g. parallel or SCSI ports, PATA/IDE disc drives)
All media decay / become obsolete over time. It is therefore necessary to refresh the media by copying the data to new media at intervals. Doing this regularly reduces the risk of discovering that your archived data is inaccessible.
If data is stored on a RAID (Redundant Array of Independent Disks), then it is possible to replace an individual disk in the array and rebuild it's content, thus refreshing the media.
Archived data is still at risk of data loss, and should be backed up somewhere else!
Archiving data is best supported through provision of a data archiving service (e.g. through a library). The burden of maintaining archival standards of storage for the media is then taken on by the service provider. This may appear to the user as a network drive, or as an archive system to which data packages may be submitted. Such a system may be part of a data management system which also supports publication of data.
h2. File Formats
File formats also become obsolete. Although the original data should be archived, it is also recommended that copies of data are stored in more accessible formats. e.g. storing PDF outputs from LaTeX source, TIFF versions of images, FLAC copies of audio files. The more specific the source format the stronger the requirement for readable formats! Closed formats (e.g. Microsoft Word documents) are particularly vulnerable to obsolescence - e.g. if you change the application you use from MS Word to Open Office, even if the document can be opened you may find that the formatting no longer works without purchasing MS Office.
* LaTeX source - will all the required packages be available if you want to rebuild the document ?
* Images - will the format be available ? is it a closed format (e.g. GIF) ?
If data is stored in lossy formats (e.g. MP3) then future decoders for that format may not produce precisely the same output (audio) as the decoder used in the initial experiments. A copy of the data should always include a lossless version of the data (e.g. PCM or FLAC for audio). Preferably, research should take place on lossless data extracted from the lossy files.
In the future, current audio formats may become obsolete, we therefore recommend that when archiving audio files, copies of the data should be stored in an open lossless format as well as in the original format. We would currently recommend using "FLAC":http://flac.sourceforge.net/ to compress audio files - FLAC files use less space than the raw data and allow metadata tags to be included (e.g. artist and track name). If the use of compressed files is not appropriate we would recommend use of uncompressed PCM audio in WAV format.
h2. Summary
Archiving data requires:
* refreshing the media at suitable intervals by moving data onto new media
* creating copies of the data in new formats to allow their use (e.g. converting data in closed formats to open formats, updating data to new versions of file formats).
For archival purposes data needs to be stored in a location which provides facilities for long-term preservation of data. As well as standard data management concerns (e.g. backup, documentation) the media and the file formats will need to be appropriate for long-term use.
Whereas work-in-progress data is expected to change regularly during the research process, archived data will change rarely, if at all. Archived data can therefore be stored on write-once media (e.g. CD-R).
In addition, it is not necessary to archive all intermediate results - reuse of archived data mean that requiring a few days to regenerate results is reasonable. However, all necessary documentation, software and data should be archived to allow results to be recreated. Existing archived datasets will not need archiving "again". However, if the archiving system supports deduplication then storing multiple copies of the same content will require minimal additional storage.
Once archived, the archive copy should not be modified directly and data access should only be required to create a new work-in-progress copy of the data to work from. Access to archived data will therefore be sporadic. Hence, it is possible to store archived data "off-line" only to be accessed when required.
It is important that archiving data is performed in an appropriate manner to allow future use of the data. This will require the use of appropriate formats for the data and storage on suitable media.
If the original content is not in an open format, then providing copies in multiple formats may be appropriate - e.g. an original Microsoft Word document, a PDF version to show how the document should look and the plain-text content so the document can be recreated.
Within C4DM, there are currently few [[Data_management_resources|resources]] available to support this. The best available option is the research group network folder as this is backed up to tape.
{{include(Archiving_Properly)}}
h2. Media
Archive copies of data may be held on the same types of media as used during research. Additionally, Write-Once media (e.g. CD-R, DVD+/-R, BDR) may be appropriate.
Removable drives (e.g. USB flash drives, firewire HDD) may be used, but there is a risk of hardware failure with these devices - they are not "just" data storage.
Removable media (e.g. CD-R, tapes) do not have the risk of hardware failure but the media themselves may be damaged or become unusable - the estimated lifetime of an optical disc is 2-100 years. Whether a specific disc will last 2 years or 100 is not something that can easily be judged - although buying high quality media rather than cheap packs of 100 discs may help.
As with all technology, there is a risk of obsolescence
* devices to read removable media may no longer be commonplace (e.g. floppy disc drives, ZIP drives)
* formats used for removable media may no longer be supported (e.g. various formats for DVD-RAM discs)
* interfaces used for removable drives may no longer be commonplace (e.g. parallel or SCSI ports, PATA/IDE disc drives)
All media decay / become obsolete over time. It is therefore necessary to refresh the media by copying the data to new media at intervals. Doing this regularly reduces the risk of discovering that your archived data is inaccessible.
If data is stored on a RAID (Redundant Array of Independent Disks), then it is possible to replace an individual disk in the array and rebuild it's content, thus refreshing the media.
Archived data is still at risk of data loss, and should be backed up somewhere else!
Archiving data is best supported through provision of a data archiving service (e.g. through a library). The burden of maintaining archival standards of storage for the media is then taken on by the service provider. This may appear to the user as a network drive, or as an archive system to which data packages may be submitted. Such a system may be part of a data management system which also supports publication of data.
h2. File Formats
File formats also become obsolete. Although the original data should be archived, it is also recommended that copies of data are stored in more accessible formats. e.g. storing PDF outputs from LaTeX source, TIFF versions of images, FLAC copies of audio files. The more specific the source format the stronger the requirement for readable formats! Closed formats (e.g. Microsoft Word documents) are particularly vulnerable to obsolescence - e.g. if you change the application you use from MS Word to Open Office, even if the document can be opened you may find that the formatting no longer works without purchasing MS Office.
* LaTeX source - will all the required packages be available if you want to rebuild the document ?
* Images - will the format be available ? is it a closed format (e.g. GIF) ?
If data is stored in lossy formats (e.g. MP3) then future decoders for that format may not produce precisely the same output (audio) as the decoder used in the initial experiments. A copy of the data should always include a lossless version of the data (e.g. PCM or FLAC for audio). Preferably, research should take place on lossless data extracted from the lossy files.
In the future, current audio formats may become obsolete, we therefore recommend that when archiving audio files, copies of the data should be stored in an open lossless format as well as in the original format. We would currently recommend using "FLAC":http://flac.sourceforge.net/ to compress audio files - FLAC files use less space than the raw data and allow metadata tags to be included (e.g. artist and track name). If the use of compressed files is not appropriate we would recommend use of uncompressed PCM audio in WAV format.
h2. Summary
Archiving data requires:
* refreshing the media at suitable intervals by moving data onto new media
* creating copies of the data in new formats to allow their use (e.g. converting data in closed formats to open formats, updating data to new versions of file formats).