Backing up » History » Version 30
Steve Welburn, 2012-11-16 04:35 PM
1 | 1 | Steve Welburn | h1. Backing up |
---|---|---|---|
2 | 1 | Steve Welburn | |
3 | 1 | Steve Welburn | h2. Why back up your data ? |
4 | 1 | Steve Welburn | |
5 | 15 | Steve Welburn | * [[Reliability|Hard disks die]] |
6 | 13 | Steve Welburn | * Portable devices can be lost or broken |
7 | 15 | Steve Welburn | * [[Disasters]] happen |
8 | 15 | Steve Welburn | * [[Tales Of Lost Data|Laptops get stolen]] |
9 | 13 | Steve Welburn | |
10 | 1 | Steve Welburn | h2. How to back up data |
11 | 1 | Steve Welburn | |
12 | 25 | Steve Welburn | The core principle is that backup copies of data should regularly be stored in a different location to the main copy. |
13 | 1 | Steve Welburn | |
14 | 1 | Steve Welburn | Suitable locations for backups are: |
15 | 21 | Steve Welburn | * A firesafe, preferably in a different building |
16 | 1 | Steve Welburn | * A network copy |
17 | 1 | Steve Welburn | ** An network drive e.g. provided by the institution |
18 | 1 | Steve Welburn | ** Internet storage (in the cloud) |
19 | 1 | Steve Welburn | ** A data repository - this could be a public thematic / institutional repository for publishing completed research datasets, or an internal repository for archiving datasets during research |
20 | 24 | Steve Welburn | * A portable device / portable media which you keep somewhere other than under your desk / with your laptop. |
21 | 1 | Steve Welburn | |
22 | 21 | Steve Welburn | Backing up on external devices means that you need access to the device... network drives and "internal" backups are usually more available. e.g. backup every time you're in the office / lab or at home. |
23 | 21 | Steve Welburn | |
24 | 26 | Steve Welburn | The best backup is the one you do. The question of how often you need to back up depends very much on how much new data you've generated / how difficult it would be to recreate the data. For primary data (e.g. digital audio recordings of interviews) you should back them up as soon as possible as they may be very time consuming to redo. If an algorithm runs for days generating data files, you may want to set it up to also create backup copies as it proceeds rather than requiring backing up at the end of the processing. If you've changed some source code and can regenerate the data in an afternoon, you may not need to back up the data - but the source code should be safely stored in a version control system "somewhere":http://code.soundsoftware.ac.uk. If you feel too busy too back up your data, it may be a hint that you should make sure there's a copy somewhere safe! |
25 | 21 | Steve Welburn | |
26 | 22 | Steve Welburn | Remember that if you delete your local copy of the data then the primary copy will be the original backup... is that copy backed up anywhere ? If a network drive is used, it *may* be backed up to tape - but this should be checked with your IT provider. |
27 | 4 | Steve Welburn | |
28 | 4 | Steve Welburn | h2. Can't I just put it in the cloud ? |
29 | 4 | Steve Welburn | |
30 | 17 | Steve Welburn | You can, but the [[Cloud Service Agreements|service agreement]] with the provider may give them a lot of rights... review the service agreement and decide whether you are happy with it! |
31 | 5 | Steve Welburn | |
32 | 17 | Steve Welburn | Looking at service agreements in November 2012, we found that Google's "terms":http://www.google.com/policies/terms/ let them use your data in any way which will improve their services - including publishing your data and creating derivative works. This is partly a side-effect of Google switching to a single set of terms for all their services. For Microsoft SkyDrive, the Windows Live "services agreement":http://windows.microsoft.com/en-US/windows-live/microsoft-services-agreement is pretty similar. |
33 | 11 | Steve Welburn | |
34 | 16 | Steve Welburn | Apple's iCloud is "better":http://www.apple.com/legal/icloud/en/terms.html as they restrict publication rights to data which you want to make public / share. "Dropbox":https://www.dropbox.com/privacy is relatively good - probably because they just provide storage and aren't mining it to use in all their other services! |
35 | 9 | Steve Welburn | |
36 | 19 | Steve Welburn | Even so, there are issues. Data stored in the cloud is still stored somewhere... you just don't have control over where that location is. Your data may be stored in a country which gives the government the right to access data. Also, the firm that stores your data may still be required to comply with the laws of its home country when the data is stored elsewhere. It is, however, unlikely that digital audio research data will be sensitive enough to find this an issue. |
37 | 11 | Steve Welburn | |
38 | 18 | Steve Welburn | A Forbes article on "Can European Firms Legally Use US Clouds To Store Data":http://www.forbes.com/sites/ciocentral/2012/01/02/can-european-firms-legally-use-u-s-clouds-to-store-data/ stated that: |
39 | 11 | Steve Welburn | |
40 | 11 | Steve Welburn | bq. Both Amazon Web Services and Microsoft have recently acknowledged that they would comply with U.S. government requests to release data stored in their European clouds, even though those clouds are located outside of direct U.S. jurisdiction and would conflict with European laws. |
41 | 11 | Steve Welburn | |
42 | 18 | Steve Welburn | If you are worried about what rights a service provider may have to your data in their cloud, then consider encrypting it - e.g. using an encrypted .dmg file on a Mac, or using "Truecrypt":http://www.truecrypt.org/ for a cross-platform solution. These create an encrypted "disc" in a file which you can mount and treat like a real disc - but all the content is encrypted. Note that changing data on an encrypted disc may change the entire contents of the disc and need to resync the whole disc to the cloud storage. Alternatively, "BoxCryptor":https://www.boxcryptor.com/ or "encFs":http://www.arg0.net/encfs (also available "for Windows":http://tinyurl.com/683ye4q) will encrypt individual files separately allowing synchronisation to operate more effectively. |
43 | 11 | Steve Welburn | |
44 | 12 | Steve Welburn | "SpiderOak":http://spideroak.com provide "zero knowledge" privacy in which all data is encrypted locally before being submitted to the cloud, and SpiderOak do not have a copy of your decryption key - i.e. they can't actually examine your data. |
45 | 20 | Steve Welburn | |
46 | 20 | Steve Welburn | See JISC/DCC document "Curation In The Cloud" - http://tinyurl.com/8nogtmv |
47 | 27 | Steve Welburn | |
48 | 27 | Steve Welburn | h2. Surely there must be a quicker way... |
49 | 27 | Steve Welburn | |
50 | 27 | Steve Welburn | Figuring out which files to copy can be very tedious, and usually leads to just backing up large chunks of data together. However, utilities can be used to copy just those files that have been updated - or even just update the parts of files that have changed. |
51 | 27 | Steve Welburn | |
52 | 27 | Steve Welburn | The main command-line utility for this on UNIX-like systems (Mac OS X, Linux) is "rsync":http://rsync.samba.org/. From the "rsync man page":http://rsync.samba.org/ftp/rsync/rsync.html: |
53 | 27 | Steve Welburn | |
54 | 27 | Steve Welburn | bq. Rsync is a fast and extraordinarily versatile file copying tool. It can copy locally, to/from another host over any remote shell, or to/from a remote rsync daemon. It offers a large number of options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied. It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. Rsync is widely used for backups and mirroring and as an improved copy command for everyday use. |
55 | 27 | Steve Welburn | |
56 | 27 | Steve Welburn | bq. Rsync finds files that need to be transferred using a "quick check" algorithm (by default) that looks for files that have changed in size or in last-modified time. Any changes in the other preserved attributes (as requested by options) are made on the destination file directly when the quick check indicates that the file's data does not need to be updated. |
57 | 28 | Steve Welburn | |
58 | 30 | Steve Welburn | For Windows, there is a "rsync tool for Windows":http://www.rsync.net/resources/howto/windows_rsync.html, and "DeltaCopy":http://www.aboutmyip.com/AboutMyXApp/DeltaCopy.jsp provides a GUI over rsync. |
59 | 28 | Steve Welburn | |
60 | 29 | Steve Welburn | In addition, there are modern continuous backup programs (e.g. Apple's "Time Machine") which will synchronise data to a backup device and allow you to revert to any point in time. However, these solutions may not be appropriate if your data is large. |
61 | 28 | Steve Welburn | |
62 | 28 | Steve Welburn | Version control systems for source code are optimised for storing plain text content and are *not* an appropriate way to store data *unless* the data is text (e.g. CSV files). |