Printable Version » History » Version 8

« Previous - Version 8/14 (diff) - Next » - Current version
Steve Welburn, 2012-11-13 12:39 PM


SoDaMaT Wiki

This contains the full content of the SodaMaT Wiki

Sound Data Management Training (SoDaMaT)

(for general information re. Research Data Management please see the parent project Wiki)

Overview

Sound Data Management Training (SoDaMaT) is an eight-month project to create and evaluate discipline-specific data management training material for digital music and audio research. The materials will be targeted to: postgraduate research students (MSc and PhD); research staff (postdoctoral researchers, CIs, PIs); and academic staff. The project is to run at the Centre for Digital Music (C4DM) at Queen Mary University of London (QMUL) from June 2012 to January 2013, in collaboration with the QMUL Learning Institute .

The immediate objectives of the SoDaMaT project are:
  1. to develop specific training material on data management planning for research projects, targeting research and academic staff in digital music and audio research;
  2. to develop training material covering the different aspects of research data management, including subject-specific topics such as music copyrights, for postgraduate students, research and academic staff in the area of digital music and audio;
  3. to collaborate with institutional partners at QMUL (The Learning Institute), other projects (SoundSoftware.ac.uk), and discipline-specific societies (Digital Music Research Network, International Society for Music Information Retrieval ) to test the training material in postgraduate courses, workshops, and tutorials, and to collect feedback on their quality and impact;
  4. to collaborate with institutional partners at QMUL (Learning Institute ; School of Electronic Engineering and Computer Science ; IT Services ) to embed the training material into postgraduate curricula and Continuous Professional Development courses to assure the long-term sustainability and generalisation of the project's results to other similar disciplines.

The requirements will be scoped, and the training materials will be trialled, within the Centre for Digital Music (C4DM), part of the School of Electronic Engineering and Computer Science at QMUL.

In addition to designing, producing, and evaluating discipline-specific training material, a wider objective is to promote good practice in research data management through education and awareness both within QMUL, and across UK and overseas research institutions in the digital music and audio area.

Background

A survey on data management practices among researchers and students at C4DM, conducted during the JISC-funded Sustainable Management of Digital Music Research Data (SMDMRD) project, showed very low awareness of the importance of research data management as part of the research workflow. Although many researchers organise their data in folders and perform semi-regular backups, to the specific question "Do you have a particular strategy for data management", the majority responded negatively. Through our links with other groups via the EPSRC -funded Sound Software project (see Collaborations section), we have good reason to believe the situation is similar in many other music and audio research groups. The results of the survey point to the need for raising awareness of the benefits of research data management, such as a potential increase in citations, understanding and meeting data management requirements set by funding bodies (e.g. EPSRC), and producing sustainable and reproducible research.

The SMDMRD project defined a set of data management policies and created a pilot data management system for C4DM. This was a pioneering effort within QMUL, and a collaboration with the QMUL IT Services has been recently established to adapt the results of the SMDMRD project to define institutional policies, and build an institutional research data repository. Policies can be used to raise awareness among research staff and students by imposing rules of conduct, and adherence to such policies is supported by tools like a data repository. Nevertheless, policies only give a general idea of why research data management is important, and the enthusiasm for using a data management system can easily fade if a culture for data management is not established. These facts point to the need for continuous, embedded, sustainable data management training, with strong focus on promoting the benefits of research data management, ideally from the early stages of a researcher's career.

The Digital Music and Audio Researcher Profile

A wealth of material for training researchers in data management has been produced by previous JISC-funded projects such as Incremental and those in the RDMTrain programme. The Research Data Management Skills Support Initiative (DaMSSI), which collected and compared the results from discipline-specific data management training projects in the RDMTrain programme, in its final report came to the conclusion that "participants respond well to discipline-specific examples and the opportunity to discuss issues with tutors and others in similar disciplines" and that "a discipline-specific approach is more likely to engage students - in many cases principles are the same across disciplines but are more interesting to students if these principles can be seen in the students' own context". DaMSSI also produced three discipline-specific researcher profiles - in the social sciences, in clinical psychology, and in archeology - and two generic data profiles - the conservator and the data manager. We believe that researchers at the Centre for Digital Music, and researchers in similar laboratories or institutions, do not fit in the above-mentioned profiles.

The Centre for Digital Music (C4DM) at QMUL is one of the leading research centres in the field of audio and music technology and signal processing. C4DM makes use of a variety of data as research inputs - most obviously audio datasets - and produces a variety of types of data as research outputs. These outputs include:
  1. manually annotated feature data ("reference annotations") such as expert chord and key transcriptions of existing music recordings which are used as comparative data for evaluating research work;
  2. automatically produced annotations such as those accompanying the publication of methods for audio feature analysis.

The primary targets for the training material to be produced by the proposed project are postgraduate research students, and research and academic staff in C4DM, who perform research over a range of areas including music informatics, machine listening, audio engineering and interaction. C4DM is one of the leading research centres in the field of audio and music technology and signal processing. C4DM makes use of a variety of data as research inputs - most obviously audio datasets - and produces a variety of types of data as research outputs. A common use-case in C4DM research is to run a newly-developed analysis algorithm on a set of audio examples and evaluate the algorithm by comparing its output with that of a human annotator. Results are then compared with published results using the same input data to determine whether the newly proposed approach makes any improvement on the state of the art.

The type of data used in digital music and audio research poses some challenges that need to be addressed in discipline-specific training material. These challenges include:
  1. Copyright: the copyright status of digital music data is often difficult to establish. For example, the owner of internally generated data might be unclear, or data purchased or downloaded from outside might have special license requirements that must be adhered to. This prevents researchers from publishing data in order to avoid unnecessary risk. Addressing this aspect in detail and emphasising the use of less restrictive licenses (e.g. Creative Commons , Open Data Commons ), could lead to a larger amount of data being published in public repositories.
  2. Metadata: the line between data and metadata is often unclear. For example, descriptive metadata (e.g. a song's title, author, year of publication, or key) is in another context used as data. The training material will focus on defining what data and metadata are, on the importance of metadata standards, and on their use, together with standard protocols such as OAI-PMH and SWORD, to exchange data among repositories.
  3. Ethical approval and participant agreement: experimental work based on human responses (e.g. perceptual listening tests) require ethical approval. The lack of information and experience on this topic leads people to write ethics forms that prohibit the release of data, preventing other researchers from reproducing or extending their results, when data could be safely released with the participants' consent if anonymised. %Data is often not published because, for lack of information, the creators tend to be exceedingly "safe" in this respect. The material will include information on how ethical approval works, how to obtain it, and information about publication of sensitive data.

In addition to the recommendations from DaMSSI , the need for specific training material for digital music and audio researchers is justified by at least two additional factors. First, most of the researchers are either computer scientists or electrical engineers and have advanced IT skills. Second, the data is very heterogeneous, rapidly changing, and relatively small in size. As a result, it is usually managed by the creator of the data itself. Thus, the clear separation pointed out by the profiles produced by DaMSSI , as well as in Pryor and Donnelly (2009, p. 165), between the data creator and the data manager/librarian/scientist becomes blurred: all the different aspects can be, and often are, taken care of by the same person.

Evaluation

Strong attention will be payed to evaluate the quality and impact on research practice of the training material. By taking advantage of the established collaborations, the material will be tested in different situations, including postgraduate courses, internal and external seminars and workshops, and tutorials at international conferences. The International Society for Music Information Retrieval (ISMIR) serves the purposes of fostering the exchange of ideas between and among members whose activities, though diverse, stem from a common interest in music information retrieval. A tutorial proposal been submitted in collaboration with the Sound Software project to the 2012 ISMIR conference (8-12 October in Porto, Portugal). A tutorial proposal will also be submitted to DAFx-12 (Digital Audio Effects conference, 17-21 September in York).

The QMUL Learning Institute will provide support and know-how in evaluation methodologies and analysis.

Feedback will be collected using:
  1. anonymous questionnaires after the tutorials/workshops, tailored to the specific audience;
  2. online questionnaires;
  3. standard course evaluation for postgraduate modules;
  4. focus groups interviewed a few months after the training to establish the longer-term impact of the training.

The feedback will be used to iteratively improve the material. Revised versions of all training materials will be available by the end of the project.

Sustainability

We aim to achieve sustainability in the longer term both in the digital music and audio research community, and within QMUL. Our goals are:
  1. to make discipline-specific training sustainable in the digital music and audio research community. Awareness will be raised by presenting the material in collaboration with the Sound Software project at similar UK research institutions, and at discipline-specific conferences (ISMIR and DAFx). Training material will be made available for reuse through the Jorum repository.
  2. to set an example within QMUL. The project will be used as an example by the QMUL Learning Institute , the School of Electronic Engineering and Computer Science, and the IT Services to expand the data management training to other disciplines by adapting the material and methodologies, starting from related research areas such as Signal Processing, and more generally Electronic Engineering and Computer Science. Data management training will be integrated in postgraduate curricula: every PhD student is expected to take part in approximately 210 hours of development activities (including research methods courses) over the course of their studies and the points gained are mapped against the four domains of the Vitae /RCUK Researcher Development Framework . Material for Continuous Professional Development courses for research and academic staff will also be adapted to other disciplines, and all face-to-face training will be complemented by online training material.

Workplan

The work of the project is divided into four work packages (WP):

An overview of the intended content of the work packages is here

Training the Trainers

Additional Notes

References

Pryor, G. and Donnelly, M. (2009). Skilling up to do data: whose role, whose responsibility, whose career? The International Journal of Digital Curation. Vol. 4(2), pp. 158--170.

Research Data Management Skills Support Initiative (DaMSSI) final report

SoDaMat Printable Version

Workplan

The work of the project is divided into four work packages (WP).

WP1 Training Material Design

WP1 Training Material Design

Although the basic principles of data management are valid for both postgraduate students, and research and academic staff, we decided to make a distinction between the two groups (WP1.3 and WP1.4) - a PhD student starting on his project and a PI writing a grant proposal might want to focus on different aspects of data management. The online material (WP1.2) will cover all aspects and be relevant to both groups.

WP1.1 Research Of Available Resources

Results from previous projects (e.g. JISC RDMTrain programme, Research Data Management Skills Support Initiative (DaMSSI), Incremental ), as well as available material from the DCC and other institutions, will be studied and evaluated. Disciplines will be compared and parts of the available material identified that need to be adapted to appeal to researchers in the area of digital music and audio research. In order to integrate the material into the Vitae /RCUK Researcher Development Framework , used to assign credits by the QMUL Learning Institute , the recently released "Information-handling Lens" will also be analysed.

WP1.2 Online Training Material

The Incremental project recommends in its final report (page 21) to "create a collection of webpages to help researchers find tools and assistance". Examples will include FAQs, fact-sheets, online step-by-step guides (e.g. on creating a data management plan for PIs writing a project proposal), short instructional videos (e.g. on how to deposit a data set into a repository, from metadata collection to choosing a license). It will target both new members of staff who could not participate in face-to-face training, and those who need quick reference material or want to learn in greater depth after a seminar. It will also contain information on where to get help for different problems (e.g. copyrights, technical) inside the institution. The online material will be prepared first because it should be already in place when face-to-face training is given.

The online materials have been prepared in the form of a wiki, and are part of this site.

WP1.3 Research Staff Material

Material will be designed that targets research and academic staff involved in funded research projects, although the basic principles will be relevant to students as well. Experience from the Sound Software project showed that different material is useful at different stages of a project. We will thus create a range of training materials to cover some of these stages, to be presented in different formats (e.g. short seminars, tutorials, workshops), and to be integrated by online material. Examples include, but are not limited to:

  1. a five-minute long "executive" pitch on the benefits of data management;
  2. hands-on workshops for CIs and PIs on data management planning for research projects;
  3. conference tutorials giving an overview of research data management;
  4. material for short seminars with in-depth analysis of single aspects of data management such as available tools, policies, and discipline specific challenges.

This material will be presented at internal seminars, discipline-specific conferences and, in collaboration with the Sound Software project, at other institutions in the UK working in the area of digital music and audio research.

WP1.4 Post-Graduate Course Material

Discipline-focused material for face-to-face training sessions will be designed. The material will cover the basics of good data management practise, point out its benefits, and touch on discipline-specific challenges such as copyrights and licenses, with discipline-specific examples, based on the recently developed C4DM Data Management System. Also, the students, as suggested by the DaMSSI project final report (Conclusions, page 15, paragraph 5), will be instructed to create a Data Management Plan for their PhD projects, to be included in their Research Proposal. The material should be sufficient to cover at most one or two sessions in a module. For more in-depth study, the students will then be referred to the online material. The material will be tested first with postgraduate students at C4DM, and then at other research groups in the Digital Music Research Network.

WP1 Deliverables

  • D1.1 Summary and analysis of material already available.
  • D1.2 First draft of the online material.
  • D1.3 First draft of the research staff material.
  • D1.4 First draft of the postgraduate course material.
  • D1.5 Updated version of the research staff material.
  • D1.6 Updated version of the online reference material.
  • D1.7 Updated version of the postgraduate course material.

WP2 Test and evaluation

WP2.1 Evaluation strategies design

With the support of the QMUL Learning Institute workshop questionnaires will be developed based on their prior experience from other projects in order to evaluate the effectiveness of training material and their delivery. The feedback obtained will be used to inform future activities.

WP2.2 Feedback collection and analysis - online material

The online material will be released as early as possible during the project. Continuous online evaluation will be used to collect feedback and make the appropriate changes.

WP2.3 Feedback collection and analysis - research staff material

The workshop material will be tested at various institutions across the UK, and at the ISMIR 2012 conference, where questionnaires will be handed out at the end of each session.

WP2.4 Feedback collection and analysis - postgraduate course material

Feedback for the postgraduate course material will be collected through the standard course evaluation procedures in place at QMUL.

WP2 Deliverables

  • D2.1 Questionnaires for evaluating training material (all types).
  • D2.2 Summary of the collected feedback for the online material and recommendations for improvement.
  • D2.3 Summary of the collected feedback for the research staff material and recommendations for improvement.
  • D2.4 Summary of the collected feedback for the postgraduate course material and recommendations for improvement.

WP3 Embedding

This work package organised the various workshops, courses and seminars in collaboration with the partners.

WP3 Deliverables

  • D3.1 Final report on embedding.

WP4 Communication and Management

WP4.1 Project management

The project will be managed on a day-to-day basis by the PI, with project meetings held weekly to assess progress and problems. This has been our practice throughout the Sound Software project and previous JISC-funded projects. The CIs will participate in the management process to ensure compatibility and continuity with the requirements of the Sound Software project from a management and technical perspective respectively.

WP4.2 Dissemination

The project results will be disseminated through blog posts, Twitter, and official reports on the project's website. Results will also be presented at discipline-specific conferences (ISMIR, DAFx), and to other similar UK-based research institutions via the partnership with the Sound Software project.

WP4 Deliverables

  • D4.1 Project site and feed.
  • D4.2 Final report and publication of the material in the Jorum repository.

References

DaMSSI final report

WP1.1 Research Of Available Resources

Results from previous projects (e.g. JISC RDMTrain programme, Research Data Management Skills Support Initiative (DaMSSI), Incremental ), as well as available material from the DCC and other institutions, will be studied and evaluated. Disciplines will be compared and parts of the available material identified that need to be adapted to appeal to researchers in the area of digital music and audio research. In order to integrate the material into the Vitae /RCUK Researcher Development Framework , used to assign credits by the QMUL Learning Institute , the recently released "Information-handling Lens" will also be analysed.

Previous JISC Projects with Data Management Training Outputs

There are lots of materials relating to data management available through Jorum these include audio interviews, PowerPoint presentations, factsheets, videos and more. Many of these are outputs of previous JISC-funded projects, and we consider some of those here.

The JISC RDMTrain programme funded five discipline-specific research data management training projects in 2010-2011.

Two projects produced online courses:
  • Project CAIRO - Managing Creative Arts Research Data (4 short units)
  • MANTRA - for geosciences, social and political sciences and clinical psychology (a very detailed self-guided course)
The remaining three courses are published as downloadable materials for training sessions:
  • DATUM for Health - 3 sessions
  • DMTpsych - 6 sessions
  • DataTrain - different versions for archaeology (4 sessions) and social anthropology (3 modules targetted at different audiences)
The Supporting Data Management Infrastructure for the Humanities (Sudamih) project at Oxford was funded by JISC under the Research Data Management Infrastructure Programme. Sudamih produced training materials specifically to fit in with the practise of humanities research at Oxford and also released de-localised materials on Jorum:
  • Three slideshows at varying levels of detail including materials targeted at post-doc researchers (Jorum)
  • Research Data Management Factsheet (Jorum)
  • Research Information Management Guides (Jorum)
  • Research Information Management: Organising Humanities Material (Jorum)
  • Research Information Management: Tools for the Humanities (Jorum)

The Incremental project at Glasgow and Cambridge was also part of the JISC Research Data Management Infrastructure programme. Incremental aimed to develop a data management infrastructure by examining existing practises and requirements at the institutions, piloting tools and services to enable data management (examples of proposed outputs included "templates, training, best practice guidelines, and policy") and embedding those outputs withing the institutions. In addition, they aimed to disseminate the results to the wider research community. During the course of the project many training resources were produced and these have been published on Jorum. We provide a summary of some resources on our page on Incremental.

Vitae Researcher Development Framework

The Vitae Researcher Development Framework (RDF) categorises the knowledge, behaviours and attributes of researchers and uses this as a foundation to guide the development of researcher skills.

In April 2012, Vitae published an information literacy component for the RDF.

Information literacy is an umbrella term which encompasses concepts such as digital, visual and media literacies, academic literacy, information handling, information skills, data curation and data management. Interacting with information is at the very heart of research and informed researchers are both consumers and producers of information.

The RDF component included an information literacy lens - mapping information literacy skills onto the RDF researcher model - and an Informed Researcher Booklet giving guidelines to researchers on evaluating and improving their information literacy.

The RDF Information Literacy lens is largely based on the Society of College, National and University Libraries (SCONUL) 7 Pillars Of Information Literacy

Other Vitae RDF Lenses and the more general SCONUL 7 Pillars Of Wisdom may also be of interest.

JISC and the Research Information Network (RIN) co-funded the Research Data Management Skills Support Initiative (DaMSSI) at the DCC. This aimed to examine how the Vitae RDF and the SCONUL 7 Pillars Of Information Literacy could be used to improve the planning of data management training, contributing to the development of the Vitae Information Literacy Lens and Informed researcher booklet, above.

The DaMSSI outputs were (from here):

Of particular interest to the current project are the mappings of previous RDM training projects onto the RDF and the Digital Curation Lifecycle Model.

DCC and Other Institutions

Several UK universities have published materials relation to data management, and, particularly, data management training: In addition, universities in other countries have also published materials:

Although the legal and funder requirements for these organisations will differ from the UK situation, the underlying principles for data management are still the same.

Some UK research councils have published policies regarding data management, data sharing and data curation:

We are in the process of summarising the main Research Council Requirements.

Other organisations have also produced materials related to data management training:

Legislation

Resources For Learning Materials

QMUL resources for e-Learning

Resources available at QMUL for eLearning include:
  • Moodle - a Course Management System (CMS), also known as a Learning Management System (LMS) or a Virtual Learning Environment (VLE).
  • Mahara "open source eportfolios", whatever that means.
  • Articulate for developing online/e-Learning materials
  • qReview lecture capture system
  • Adobe Connect web-conferencing
  • Bristol Online Surveys (QMUL) for developing... surveys

Links

Doctoral Training Centres as catalysts for research data management
RDM training for Postgraduates and Doctoral Training Centres
Open Exeter PGR Workshop on Data Management

DataTrain

DataTrain for archaeology and for social anthropology

Modules are available on Jorum:

Archaeology

Licensed CC-BY-NC-SA

Structure of course:

Modules:

  1. Creating and managing research data in archaeology: an overview
  2. Data lifecycles and management plans
  3. Working with digital data
  4. Rights and digital data
  5. E-Theses and supplementary digital data
  6. Archiving digital data
  7. Post-Graduate data management plans
  8. Project and professional data: data management on post-doctoral research projects and beyond

The teaching modules were run as a trial course in March 2011, as part of a post-graduate course in Digital Skills for Dissertation and Publications, Department of Archaeology, University of Cambridge. The data management course comprised 4 x 2 hour sessions:

  1. Creating and Managing Data - Defining post-graduate research data
  2. Working with Digital Data
    File structure, naming, and formats
    E-theses and supplementary digital data
    Post-Graduate Data Management Plans
  3. Project and Professional Data
    Data management for larger research projects
  4. Archiving and Re-using Data
    Depositing digital data
    Intellectual Property Rights and research data

The slides and notes have been kept as simple and as straight forward as possible. They are not meant to be exhaustive in the information they contain. Rather, they provide an overview of the general issues regarding data management.

Each module has been designed to take approximately 30 minutes to complete. Six of the eight presentations have between 10 and 16 slides (including front title and end acknowledgement slides). The two longer modules are Module 3: Working with Digital Data; and Module 8: Project and Professional Data.

Module 3 (Working with Digital Data) has 38 slides many of which contain a lot of information on different file types and formats. This information has been summarised from the Archaeology Data Service’s Guides to Good Practice, and content most relevant to post-graduate students is presented in a straight forward way. Rather that spending an hour presenting Module 3 in detail (and boring the students to death), it is suggested that the slides be presented as a ‘lightening tour’ of the practical issues of working with digital data. The slides can then be made available for future reference by the students as a handout.

Module 8 (Project and Professional Data) provides an introduction into data management at a higher level of research, including writing AHRC Technical Appendices. While this can be run as a stand alone session, given that this is the desired career path of many doctoral students, and the fact that many doctoral students carry out their research as part of larger projects, the aim of the module is to round off the post-graduate course by looking forward beyond the submission of a PhD Thesis.

Comments regarding discipline-specific nature (from notes for part 1 of course):

Can archaeology be considered in any way a special case in terms of how we create, manage, and archive digital data?
The simple answer is no. The issues of how best to manage digital data and safeguard it preservation in the long term are broadly the same across all disciplines.
The same goes for individual archaeological projects. Even though some might think that their own project is a special case in terms of complicated digital data, or for the fact that they will produce very little in the way of digital data, at the heart of it, the same issues apply, just on a larger or smaller scale.
A key issue which does vary from discipline to discipline is that of what are private data and what are public data. This does arise in archaeology particularly in regard to sensitive data of site or artefact locations, or sensitive personal data collected during the course of a research project.
What perhaps sets archaeology apart from other disciplines is the appreciation of the historical significance of what we do. And the fact that very often, the practice of archaeology is a destructive process and the physical and digital data obtained represent a unique archive – an experiment that cannot be repeated.

However... primary data is often paper-based. Notes, sketches etc.

One area of discipline-specificness is the selection of bodies that provide definitions of good practise and/or archiving facilities (e.g. Archaeology Data Service). Who are these for digital audio research ? AES ? JASA ? ISMIR ? IEEE ? Others ?

Includes details of copyright terms for 8 types of creative works: Literary; Artistic; Sound; Typographic; Broadcasts; Dramatic; Film; and Musical.

For post-grad students, e-Theses are covered. Publishing a digital copy of a thesis makes it "published" and means that all copyright details need to be ironed out.

Part 8 is largely related to resources (arch. specific).

Social Anthropology

A different approach...
  • Basic module - aimed at pre-fieldwork PhD students, fundamentals
  • Advanced module - metadata, ethics, IPR, FoI, data protection, tools
  • Writing-up module - for PhD students and early stage researchers, includes info on long-term archiving

Can be combined to produce a 1-day course.

Mentions reference management. Line between Reserach Data Management and Data Management ?

Lots of info. on data capture - digitizing data.

Points to interesting list of formats from the UK Data Archive":http://www.data-archive.ac.uk

Posting things on CDs/DVDs might be a good idea for infrequent sharing of large amounts of data. Beware of security issues, which can be sidestepped by encryption (more later); and of decay/damage.

In the Advanced module, examples are drawn from the discipline.

DATUM for Health

Comprises 3 sessions: Plus additional notes for:

Downloadable from Jorum (CC-BY-NC-SA)

Session 1: Introduction To Data Management (Northumbria)

  • What is research data ?
  • Where is your research data ?
  • Why manage research data
    • a requirement
    • to work effectively & efficiently
    • to protect it
    • for use and/or re-use
    • to share it
    • for preservation
    • because it is good research practice
  • How to manage research data
  • The research data lifecycle
    • Plan / Create / Analyse / Preserve / Share /Use (and repeat...)
  • Creating a DMP

Session 2: Data Curation Lifecycle (Northumbria)

  • What is data curation ?
  • Why curate ?
    • Requirements
    • Rewards
  • DCC Data Curation Lifecycle Model
    • Conceptualise - planning
    • Create - collection & analysis
    • Appraise - selection
    • Ingest - transferring to a custodian
    • Preserve - keeping data over time
    • Store - keeping data safe
    • Access - finding data
    • Transform - generating new data

Session 3: Problems and practical strategies and solutions (Northumbria)

  • What problems are there ?
    • Conflicting considerations
    • Resource issues
    • anything else ?
  • Conflicts
    • Confidentiality and sharing
      • personal and sensitive data - anonymisation, consent
  • Data security and storage
    • File and folder names
    • Locations
    • Email is not secure
    • Physical security - destroy USB sticks, shred documents
  • Metadata

DMTpsych

Postgraduate training for research data management in the psychological sciences

DMTpsych built upon existing research data management materials developed by the Digital Curation Centre Opens new window (DCC) to create discipline-focused postgraduate training materials that can be embedded into postgraduate research training in the psychological sciences. The materials produced consist of:

  • PowerPoint slides to be used in taught research methods courses
  • Workbook containing psychology specific guidance on completing the DCC’s Online Data Management Planning Tool (including worked examples)
  • A paper copy of the DMPT Opens new window to be completed by students (actually at DCC)

The lectures are structured thematically to match the existing DCC DMPT with the eight key sections forming the centrepiece of six psychology specific lectures and round table discussions.

Deliverables online

Material available for:
  • Overview
  • 1. Historical and Conceptual Issues and Best Practice
  • 2. Introduction and context to psychology-specific DMPT
  • 3a. Access, data sharing and re-use; Legal and ethical issues
    Good detail on Data Protection and FoI. Less good on IPR.
  • 3b. Data standards and capture methods
  • 4. Short-term storage and data management; Deposit and long-term preservation
  • 5. Resourcing; Adherence, review and long-term management
  • 6. Completion of your own Data Management Plan
  • Informed Consent Form

Licensed CC-BY-NC:

This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms.

Written from a psychology perspective... but the content isn't particularly psych.

Incremental

Incremental

This project will build on earlier work by HATII and the DCC to support research data management. It will analyse needs at Glasgow and Cambridge across a number of different disciplines; propose a range of tools or services to address those needs; and develop, adapt and pilot these within each institution. Outputs will then be further adapted and prepared for embedding in local infrastructures and wider dissemination via the Digital Curation Centre, Digital Preservation Coalition, and JISC. The project intends to focus on the provision of softer infrastructure (e.g. templates, training, best practice guidelines, and policy).

Includes multimedia files (audio, video)

Funded by JISC 2010-2011

From Jorum (largely CC-BY-NC-SA)

  • Re-use, sharing, and archiving sensitive research data: a practical overview - slideshow (Jorum)
  • How data centres and repositories can help with research data management (Jorum)
  • University of Glasgow: bidding for grant funding workflow (Jorum)
  • University of Cambridge: bidding for funding workflow (Jorum)
  • The university ethics process and how it impacts on making creative work (Jorum)
  • The benefits of sharing research data (Jorum)
Digital media:
  • Managing music data (Jorum)
  • Managing multimedia research data (Jorum)
  • Working with digital media files (Jorum)
Sensitive data:
  • Archiving sensitive research data (Jorum)
  • Managing sensitive data in performing arts - narrated slideshow (Jorum)
  • Re-use, sharing, and archiving sensitive research data: a practical overview - slideshow (Jorum)
IPR:
  • Intellectual Property Rights and Research Data: Focus on copyright - narrated slideshow (Jorum)
  • Who owns IPR? - flowchart (Jorum)
  • Intellectual property rights (IPR) and the creation and use research materials (Jorum)
  • Intellectual property rights and University of Cambridge: Focus on patents and commercialisation - narrated slideshow (Jorum)
FoI:
  • How the Freedom of Information Act (FOI) applies to research data (Joprum)
  • FAQ for Freedom of Information and Environmental Information Requests for Research Data - narrated slideshow (Jorum)
  • Using the UK Freedom of Information Act: A practical guide for researchers - narrated slideshow (Jorum)
  • What options do researchers have when asked to release their data by an FOI request? (Jorum)
Factsheets:
  • UK research funders' data policies (Jorum)
  • Organising files and folders (Jorum)
  • Adding metadata to Microsoft Office documents (Jorum)
  • Choosing the right digital storage media for you (Jorum)
  • Selecting which data to keep (Jorum)
  • Common Image Formats (Jorum)
  • Selecting which data to keep at University of Glasgow (Jorum)
  • Version control across devices (Jorum)

Incremental Project

Content produced by the Incremental project is released under Creative Commons licence BY-NC-SA

Project site at Cambridge

  • Create
  • Organise
  • Access
  • Look After

MANTRA

geosciences, social and political sciences and clinical psychology

This course is an Open Educational Resource that may be freely used by anyone.
It is available through an open license for re-using, rebranding, repurposing.

License

You are free to re-use part or all of this work elsewhere, with or without modification. In order to comply with the attribution requirements of the Creative Commons license (CC-BY), we request that you cite:

  • the author/creator: EDINA and Data Library, University of Edinburgh
  • the title of the work: Research Data MANTRA [online course]
  • the URL where the original work can be found: http://datalib.edina.ac.uk/mantra

Downloadable from Jorum (DSpace!)

Structure:
  • Introduction
  • Research Data Explained: in depth discussion of what data is and types of data
  • Data Management Plans
  • Organising Data: File naming and storing
  • File Formats
  • Documentation and Metadata
  • Storage and Security
  • Data Protection, Rights and Access (in development / not available)
  • Preservation, Sharing and Licensing (in development / not available)
  • Recommended Resources

Includes videos. Approx. 20 slides per topic (13-30).

Appears to have been built using Xerte

Project CAIRO - Managing Creative Arts Research Data

Online course materials consisting of four units: Downloadable from Jorum:

Introduction to Research Data Management

Examines:

Presents a workflow for arts data management:
Planning -> Creating -> Shaping -> Long-term management

Planning should be good practice - science is largely about data collection and evaluation. Planning this process is part of experimental design. Can you easily meet more than just immediate data needs and contribute to the community at large ?
  • Who's it for ?
  • What documentation is required ?
  • Are there stipulations on data management (timescales, repositories, publish, sensitivity) ?
  • Is assessment required ? How do we enable it ?
  • Are there guidelines we should follow (e.g. institutional) ?
  • If we will publish data, do repositories have requirements for formats ?
Creating is day-to-day working data management:
  • collecting permissions as required
  • documenting data
  • considering file formats
  • backups
Shaping is curation:
  • selection of data
  • extending metadata
  • use of sustainable file formats

Long-term management is after-the-research management of data - NB: the nature of this means that it will involve handing data over to a long-term archive. Occasional activities required (e.g. changing file formats)

AHRC rules are that data needs to be kept for 3 years after a project concludes. (see Research Council Requirements)

Creating Research Data

Focuses on actions before data is created:

HE and FE institutions should ensure that [...] employees and students are aware that, while some exemptions are granted for the use of personal data for research purposes, the majority of the Data Protection Principles must still be conformed to — there is no blanket exemption.
(JISC Data Protection Code of Practice for the HE and FE Sectors (2001))

The simplest way to deal with DPA is to remove personal information. Anonymised data doesn't come under the DPA. So consider carefully whether any personal details in data add to its usefulness or could be removed

Managing Research Data

Delivering Research Data

Identifying issues that come to light after the creation of research data, and overcoming those issues.

WP2.1 Evaluation Strategy Design

In order to evaluate the (training) materials, it is necessary to:
  • identify specific (learning) objectives which they aim to meet
  • evaluate whether the materials meet those objectives
    additionally, we need to
  • identify the overall purposes of the materials
  • evaluate whether the cumulative objectives satisfy these overall purposes

In order to produce the best possible materials, it is necessary to evaluate and then revise the materials. Initial evaluation of the materials will take place once a first draft has been created, but before they are used in training. This will concentrate on the suitability and level of the content. After the initial evaluation and update, the materials will be used in training courses and begin an ongoing series of formative and summative evaluations (i.e. evaluations during and after the training). These evaluations will apply Kirkpatrick's four-level evaluation model1

Methods of Evaluation

Design review

Pre-course (evaluation of materials)

On-going evaluation of course

  • Informal / Formal Review e.g.:
    • Questionnaire to see how easy it is to find relevant material / test users knowledge
    • Focus Groups (I-Tech guide)
  • In-course / formative evaluation (see I-Tech)
    • Assessment of level of knowledge within training group
    • Checking progress with participants
    • Trainer assessment - self assessment and from other trainers if possible
    • Pre- and Post-course questionnaires - assess change in answers (true/false + multiple-choice)
  • Post-course / summative evaluation
    • Debriefing of trainer (did it work ? to time ? did it engage people ?)
    • Questionnaire for participants Sample training evaluation forms
    • Medium-term review of usefulness of course content / adoption of techniques (e.g. 2-3 months after course)
Kirkpatrick's Four Level Evaluation Model
  • Reaction - to the course ("motivation" may be more appropriate)
    • Pacing, was it enjoyable
  • Learning - from the course
    • Did the facts get across ?
  • Behavior - changes after the course
    • Did participants actually manage their data better
      • during research ?
      • at the end of research ?
    • Have data management plans been produced for grant proposals ?
  • Results or Impact - to a wider community
    • Did they publish data ?
    • Was any data loss avoided ?
Review content for:
  • reading level
  • correctness
  • organization
  • ease of use

based on the target audience

Tools and links

Bristol Online Surveys

I-Tech Training Toolkit

Instructional System Design approach to training

Free Managemnt Library - Evaluating Training and Results

Lingualinks Implement A Literacy Program - Evaluating Training

Training Works!... ...what you need to know about managing, designing, delivering, and evaluating group-based training

References

[1] Kirkpatrick, D. L. (1959). Techniques for evaluating training programs. Journal of the American Society of Training Directors, 13, 3–9.

[2] Kirkpatrick, D. L. (1976). Evaluation of training. In R. L. Craig (Ed.), Training and development handbook: A guide to human resource development (2nd ed., pp. 301–319). New York: McGraw-Hill

WP1.2 Online Training Material

We consider three stages of a research project, and the appropriate research data management considerations for each of those stages. The stages are:

In addition, we consider the responsibilities of a Principal Investigator regarding data management.

Background material is also available on "why manage research data ?", and there is an alternate view of the content based on individual research data management skills.

Before The Research - Planning Research Data Management

A data management plan is an opportunity to think about the resources that will be required during the lifetime of the research project and to make sure that any necessary resources will be available for the project. In addition, it is likely that some form of data management plan will be required as part of a grant proposal.

The main questions the plan will cover are:
  • What type of storage do you require ?
    Do you need a lot of local disk space to store copies of standard datasets ? Will you be creating data which should be deposited in a long-term archive, or published online ? How will you back up your data ?
  • How much storage do you require ?
    Does it fit within the standard allocation for backed-up storage ?
  • How long will you require the storage for ?
    Is data being archived or published ? Does your funder require data publication ?
  • How will this storage be provided ?
Appropriate answers will relate to: Additional questions may include:
  • What is the appropriate license under which to publish data ?
  • Are there any ethical concerns relating to data management e.g. identifiable participants ?
  • Does your research data management plan comply with relevant legislation ?
    e.g. Data Protection, Intellectual Property and Freedom of Information

A minimal data management plan for a project using standard C4DM/QMUL facilities could say:

During the project, data will be created locally on researchers machines and will be backed up to the QMUL network. Software will be managed through the code.soundsoftware.ac.uk site which provides a Mercurial version control system and issue tracking. At the end of the project, software will be published through soundsoftware and data will be published on the C4DM Research Data Repository.

For larger proposals, a more complete plan may be required. The Digital Curation Centre have an online tool (DMP Online) for creating data management plans which asks (many) questions related to RCUK principles and builds a long-form plan to match research council requirements.

It is important to review the data management plan during the project as it is likely that actual requirements will differ from initial estimates. Reviewing the data management plan against actual data use will allow you to assess whether additional resources are required before resourcing becomes a critical issue.

In order to create an appropriate data management plan, it is necessary to consider data management requirements during and after the project.

The Digital Curation Centre (DCC) provide DMP Online, a tool for creating data management plans. The tool can provide a data management questionnaire based on institutional and funder templates and produce a data management plan from the responses. Documents are available describing how to use DMP Online.

During The Research

During the course of a piece of research, data management is largely risk mitigation - it makes your research more robust and allows you to continue if something goes wrong.

The two main areas to consider are:
  • backing up research data - in case you lose, or corrupt, the main copy of your data;
  • documenting data - in case you need to to return to it later.

In addition to the immediate benefits during research, applying good research data management practices makes it easier to manage your research data at the end of your research project.

We have identified three basic types of research projects, two quantitative (one based on new data, one based on a new algorithm) and one qualitative, and consider the data management techniques appropriate to those workflows. More complex research projects might require a combination of these techniques.

Quantitative research - New Data

For this use case, the research workflow involves:
  • creating a new dataset
  • testing outputs of existing algorithms on the dataset
  • publication of results
The new dataset might include:
  • selection or creation of underlying (audio) data (the actual audio might be in the dataset or the dataset might reference material - e.g. for copyright reasons)
  • creation of ground-truth annotations for the audio and the type of algorithm (e.g. chord sequences for chord estimation, onset times for onset detection)
Although the research is producing a single new dataset, the full set of research data involved includes:
  • software for the algorithms
  • the new dataset
  • identification of existing datasets against which results will be compared
  • results of applying the algorithms to the dataset
  • documentation of the testing methodology - e.g. method and algorithm parameters (including any default parameter values).

All of these should be documented and backed up.

Note that if existing algorithms have published results using the same existing datasets and methodology, then results should be directly comparable between the published results and the results for the new dataset. In this case, most of the methodology is already documented and only details specific to the new dataset need to be recorded separately.

If the testing is scripted, then the code used would be sufficient documentation during the research - readable documentation only being required at publication.

Quantitative research - New Algorithm

A common use-case in C4DM research is to run a newly-developed analysis algorithm on a set of audio examples and evaluate the algorithm by comparing its output with that of a human annotator. Results are then compared with published results using the same input data to determine whether the newly proposed approach makes any improvement on the state of the art.

Data involved includes:
  • software for the algorithm
  • an annotated dataset against which the algorithm can be tested
  • results of applying the new algorithm and competing algorithms to the dataset
  • documentation of the testing methodology

Note that if other algorithms have published results using the same dataset and methodology, then results should be directly comparable between the published results and the results for the new algorithm. In this case, most of the methodology is already documented and only details specific to the new algorithm (e.g. parameters) need to be recorded separately.

Also, if the testing is scripted, then the code used would be sufficient documentation during the research - readable documentation only being required at publication.

Qualitative research

An example would be using interviews with performers to evaluate a new instrument design.

The workflow is:
  • Gather data for the experiment (e.g. though interviews)
  • Analyse data
  • Publish data
Data involved might include:
  • the interface design
  • Captured audio from performances
  • Recorded interviews with performers (possibly audio or video)
  • Interview transcripts

Survey participants and interviewees retain copyright over their contributions unless they are specifically assigned to you! In order to have the freedom to publish the content a suitable rights waiver / transfer of copyright / clearance form / licence agreement should be signed. Or agreed on tape. Also, the people (or organisation) recording the event will have copyright on their materials... unless assigned/waived/licensed (e.g. video / photos / sound recordings). Most of this can be dealt with fairly informally for most research, but if you want to publish data then a more formal agreement is sensible. Rather than transferring copyright, an agreement to publish the (possibly edited) materials under a particular license might be appropriate.

Creators of materials (e.g. interviewees) always retain moral rights to their words: they have the right to be named as the author of their content; and they maintain the right to object to derogatory treatment of their material. Note that this means that in order to publish anonymised interviews, you should have an agreement that allows this.

If people are named in interviews (even if they're not the interviewee) then the Data Protection Act might be relevant.

The research might also involve:
  • Demographic details of participants
  • Identifiable participants (Data Protection)
  • Release forms for people taking part
and is likely to involve:

At The End Of The Research

Whether you have finished a research project or simply completed an identifiable unit of research (e.g. published a paper based on your research), you should look at:

Publication of the results of your research will require:
  • Summarising the results
  • Publishing a relevant sub-set of research data / summarised data to support your paper
  • Publishing the paper

Note that the EPSRC data management principles require sources of data to be referenced.

Research Management

The data management concerns of a PI will largely revolve around planning and appraisal of data management for research projects: to make sure that they conform with institutional policy and funder requirements; and to ensure that the data management needs of the research project are met.

A data management plan (e.g. for use in a grant proposal) will show that you have considered:
  • the costs of preserving your data;
  • funder requirements for data preservation and publication;
  • institutional data management policy
  • and ethical issues surrounding data management (e.g. data relating to human participants).
Specific areas to examine may include:

After the project is completed, an appraisal of how the data was managed should be carried out as part of the project's "lessons learned".

Data management training should provide an overview of all the above, and keep PIs informed of any changes in the above that affect data management requirements.

Data Management Skills

Archiving research data
Backing up
Documenting data
Managing software as data
Licensing research data
Publishing research data

Data Management Background

Research Council requirements
Relevant legislation

Data Management Motivation

Why manage research data ?

Available Resources

Resources available for C4DM researchers

Backing up

Why back up your data ?

How to back up data

The core principle is that backup copies of data should regularly be stored in a different location to the main copy.

Suitable locations for backups are:
  • A firesafe, preferably in a different building
  • A network copy
    • A network drive e.g. provided by the institution
    • Internet storage (in the cloud)
    • A data repository - this could be a public thematic / institutional repository for publishing completed research datasets, or an internal repository for archiving datasets during research
  • A portable device / portable media which you keep somewhere other than under your desk / with your laptop.

Backing up on external devices means that you need access to the device... network drives and "internal" backups are usually more available. e.g. backup every time you're in the office / lab or at home.

The best backup is the one you do. The question of how often you need to back up depends very much on how much new data you've generated / how difficult it would be to recreate the data. For primary data (e.g. digital audio recordings of interviews) you should back them up as soon as possible as they may be very time consuming to redo. If an algorithm runs for days generating data files, you may want to set it up to also create backup copies as it proceeds rather than requiring backing up at the end of the processing. If you've changed some source code and can regenerate the data in an afternoon, you may not need to back up the data - but the source code should be safely stored in a version control system somewhere. If you feel too busy too back up your data, it may be a hint that you should make sure there's a copy somewhere safe!

Remember that if you delete your local copy of the data then the primary copy will be the original backup... is that copy backed up anywhere ? If a network drive is used, it may be backed up to tape - but this should be checked with your IT provider.

Details of resources available for C4DM researchers are available here.

Can't I just put it in the cloud ?

You can, but the service agreement with the provider may give them a lot of rights... review the service agreement and decide whether you are happy with it!

Looking at service agreements in November 2012, we found that Google's terms let them use your data in any way which will improve their services - including publishing your data and creating derivative works. This is partly a side-effect of Google switching to a single set of terms for all their services. For Microsoft SkyDrive, the Windows Live services agreement is pretty similar.

Apple's iCloud is better as they restrict publication rights to data which you want to make public / share. Dropbox is relatively good - probably because they just provide storage and aren't mining it to use in all their other services!

Even so, there are issues. Data stored in the cloud is still stored somewhere... you just don't have control over where that location is. Your data may be stored in a country which gives the government the right to access data. Also, the firm that stores your data may still be required to comply with the laws of its home country when the data is stored elsewhere. It is, however, unlikely that digital audio research data will be sensitive enough to find this an issue.

A Forbes article on Can European Firms Legally Use US Clouds To Store Data stated that:

Both Amazon Web Services and Microsoft have recently acknowledged that they would comply with U.S. government requests to release data stored in their European clouds, even though those clouds are located outside of direct U.S. jurisdiction and would conflict with European laws.

If you are worried about what rights a service provider may have to your data in their cloud, then consider encrypting it - e.g. using an encrypted .dmg file on a Mac, or using Truecrypt for a cross-platform solution. These create an encrypted "disc" in a file which you can mount and treat like a real disc - but all the content is encrypted. Note that changing data on an encrypted disc may change the entire contents of the disc and need to resync the whole disc to the cloud storage. Alternatively, BoxCryptor or encFs (also available for Windows) will encrypt individual files separately allowing synchronisation to operate more effectively.

SpiderOak provide "zero knowledge" privacy in which all data is encrypted locally before being submitted to the cloud, and SpiderOak do not have a copy of your decryption key - i.e. they can't actually examine your data.

See JISC/DCC document "Curation In The Cloud" - http://tinyurl.com/8nogtmv

Surely there must be a quicker way...

Figuring out which files to copy can be very tedious, and usually leads to just backing up large chunks of data together. However, utilities can be used to copy just those files that have been updated - or even just update the parts of files that have changed.

The main command-line utility for this on UNIX-like systems (Mac OS X, Linux) is rsync. From the rsync man page:

Rsync is a fast and extraordinarily versatile file copying tool. It can copy locally, to/from another host over any remote shell, or to/from a remote rsync daemon. It offers a large number of options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied. It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. Rsync is widely used for backups and mirroring and as an improved copy command for everyday use.

Rsync finds files that need to be transferred using a "quick check" algorithm (by default) that looks for files that have changed in size or in last-modified time. Any changes in the other preserved attributes (as requested by options) are made on the destination file directly when the quick check indicates that the file's data does not need to be updated.

For Windows, there is a rsync tool for Windows, and DeltaCopy provides a GUI over rsync.

In addition, there are modern continuous backup programs (e.g. Apple's "Time Machine") which will synchronise data to a backup device and allow you to revert to any point in time. However, these solutions may not be appropriate if your data is large.

Version control systems for source code are optimised for storing plain text content and are not an appropriate way to store data unless the data is text (e.g. CSV files).

Archiving research data

For archival purposes data needs to be stored in a location which provides facilities for long-term preservation of data. As well as standard data management concerns (e.g. backup, documentation) the media and the file formats will need to be appropriate for long-term use.

Whereas work-in-progress data is expected to change regularly during the research process, archived data will change rarely, if at all. Archived data can therefore be stored on write-once media (e.g. CD-R).

In addition, it is not necessary to archive all intermediate results - reuse of archived data means that requiring a few days to regenerate results is reasonable. However, all necessary documentation, software and data should be archived to allow results to be recreated. Existing archived datasets will not need archiving "again". However, if the archiving system supports deduplication then storing multiple copies of the same content will require minimal additional storage.

Once archived, the archive copy should not be modified directly and data access should only be required to create a new work-in-progress copy of the data to work from. Access to archived data will therefore be sporadic. Hence, it is possible to store archived data "off-line" only to be accessed when required.

It is important that archiving data is performed in an appropriate manner to allow future use of the data. This will require the use of appropriate formats for the data and storage on suitable media.

If the original content is not in an open format, then providing copies in multiple formats may be appropriate - e.g. an original Microsoft Word document, a PDF version to show how the document should look and the plain-text content so the document can be recreated.

Within C4DM, there are currently few resources available to support this. The best available option is the research group network folder as this is backed up to tape.

Archiving Data

BBC Domesday Project

1986 Project to do a modern-day Domesday book (early crowd-sourcing)
  • Used “BBC Master” computers with data on laserdisc
  • Collected 147,819 pages of text and 23,225 photos
  • Media expiring and obsolete technology put the data at risk!
Domesday Reloaded (2011) To allow long-term access to data
  • Don't use obscure formats!
  • Don't use obscure media!
  • Don't rely on technology being available!
  • Do keep original source material!

Google images for BBC Domesday

Media

Archive copies of data may be held on the same types of media as used during research. Additionally, Write-Once media (e.g. CD-R, DVD+/-R, BDR) may be appropriate.

Removable drives (e.g. USB flash drives, firewire HDD) may be used, but there is a risk of hardware failure with these devices - they are not "just" data storage.

Removable media (e.g. CD-R, tapes) do not have the risk of hardware failure but the media themselves may be damaged or become unusable - the estimated lifetime of an optical disc is 2-100 years. Whether a specific disc will last 2 years or 100 is not something that can easily be judged - although buying high quality media rather than cheap packs of 100 discs may help.

As with all technology, there is a risk of obsolescence
  • devices to read removable media may no longer be commonplace (e.g. floppy disc drives, ZIP drives)
  • formats used for removable media may no longer be supported (e.g. various formats for DVD-RAM discs)
  • interfaces used for removable drives may no longer be commonplace (e.g. parallel or SCSI ports, PATA/IDE disc drives)

All media decay / become obsolete over time. It is therefore necessary to refresh the media by copying the data to new media at intervals. Doing this regularly reduces the risk of discovering that your archived data is inaccessible.

If data is stored on a RAID (Redundant Array of Independent Disks), then it is possible to replace an individual disk in the array and rebuild it's content, thus refreshing the media.

Archived data is still at risk of data loss, and should be backed up somewhere else!

Archiving data is best supported through provision of a data archiving service (e.g. through a library). The burden of maintaining archival standards of storage for the media is then taken on by the service provider. This may appear to the user as a network drive, or as an archive system to which data packages may be submitted. Such a system may be part of a data management system which also supports publication of data.

File Formats

File formats also become obsolete. Although the original data should be archived, it is also recommended that copies of data are stored in more accessible formats. e.g. storing PDF outputs from LaTeX source, TIFF versions of images, FLAC copies of audio files. The more specific the source format the stronger the requirement for readable formats! Closed formats (e.g. Microsoft Word documents) are particularly vulnerable to obsolescence - e.g. if you change the application you use from MS Word to Open Office, even if the document can be opened you may find that the formatting no longer works without purchasing MS Office.

  • LaTeX source - will all the required packages be available if you want to rebuild the document ?
  • Images - will the format be available ? is it a closed format (e.g. GIF) ?

If data is stored in lossy formats (e.g. MP3) then future decoders for that format may not produce precisely the same output (audio) as the decoder used in the initial experiments. A copy of the data should always include a lossless version of the data (e.g. PCM or FLAC for audio). Preferably, research should take place on lossless data extracted from the lossy files.

In the future, current audio formats may become obsolete, we therefore recommend that when archiving audio files, copies of the data should be stored in an open lossless format as well as in the original format. We would currently recommend using FLAC to compress audio files - FLAC files use less space than the raw data and allow metadata tags to be included (e.g. artist and track name). If the use of compressed files is not appropriate we would recommend use of uncompressed PCM audio in WAV format.

Summary

Archiving data requires:
  • refreshing the media at suitable intervals by moving data onto new media
  • creating copies of the data in new formats to allow their use (e.g. converting data in closed formats to open formats, updating data to new versions of file formats).

Documenting data

What should you document ?

You should document the data so that people can understand it - what units the data is in, how the data was created, why the data was created and possible uses for the data.

As well as summary documentation for the entire dataset, individual data files should have their own documentation.

How to document data

  • Use a suitable directory structure. Documentation can then give a summary of all the files within a folder.
  • Use meaningful filenames
    • The more meaningful the better
    • However, they should be succinct
    • It may be necessary to refer to an explanation of the filenames to identify their content
    • Files may be moved from their original directory structure so filenames should be sufficient to identify a particular file
  • If documentation is required to understand file contents, copy the documentation when copying the files
  • Use standard file formats where possible - and preferably open formats so that files can be reused
  • Create README files with textual explanations of file content
  • Use the capabilities of file formats for self-documentation
    • If you have text files of data, consider including comment lines for explanations
    • Fill in author, title, date and comments for file formats that support them (e.g. PDF, Word .doc etc.)
    • Consider including <!-- --> comments in XML data
  • If data is created algorithmically / by code
    • Consider automatically writing out textual descriptions when the data is created
    • Document the values of all the parameters used to create the data
    • Remember to document the actual values of parameters for which default values were accepted - the default values might change with different versions of the code

Managing Software As Data

For existing software used in research, the appropriate citation, version and source should be documented. This may need to include versions of any libraries required by the software as changes to the libraries might affect the outputs.

For new software, as for data, the management issues are:

However, whereas data changes slowly / infrequently, software is subject to ongoing changes during a project. Source code for software usually consists of text files and should therefore be stored in a suitable version control system (e.g. Mercurial, Subversion, git). Binary releases of software may also be created as downloads for a project.

Additionally, software documentation has broader requirements - including both documentation to make the code maintainable (e.g. comments in the code, documenting APIs, Javadoc style documentation) and user documentation to explain how to install and use the software.

The Sound Software project provides software project management facilities for digital music and audio research including Mercurial version control, downloads, documentation, issue lists and wikis through its code repository

Other possible repositories for source code include:

The Sound Software project has information on choosing a version control system and provides a cross-platform, easy-to-use, graphical client for use with Mercurial.

Publishing research data

Research data publication allows your data to be reused by other researchers e.g. to validate your research or to carry out follow-on research. To that end, a suitable data publication host will allow your data to be discovered (e.g. by publishing metadata) and will be publicly accessible (i.e. on the internet).

Research data can be published on the internet through:
  • project web sites
  • research group web-sites
  • generic web archives (e.g. archive.org)
  • research data sites (e.g. figshare)
  • more general open access research hosts (e.g. f1000 Research)
  • thematic repositories dedicated to a specific discipline / subject area - sadly there is no sign of an appropriate repository for digital music and audio research
  • institutional repositories dedicated to research from a specific organisation (e.g. QMUL have a repository through which Green open access copies of papers by QM research staff can be published).
  • supplementary materials attached to journal articles

An appropriate license should be granted to allow other researchers to use your research data.

Within the Centre for Digital Music, we now have a research data repository for publishing research data outputs from the group. Publishing data though the C4DM repository gives a single point for publishing C4DM data on the internet without relying on (possibly ephemeral) project-specific web-sites. Other repositories that may be of interest to researchers are listed here.

If the web-site through which the data is published is also to be the long-term archive for you data, then you should check that the meets the criteria for an archival storage system. Note that although data will be written to the host irregularly, it is expected that published data will be accessed more frequently than archived data making offline storage unsuitable.

If an external publisher is used for your research data, you should check the Terms and Conditions e.g. to see whether copyright on the data is transferred to the publisher and to check for how long they will publish your data.

If data is published through a publisher or repository, then it may also be held on institutional storage as long as the publisher's license is followed, which might e.g. require that there is a link back to the publisher from the institutional repository. Publishing under a Creative Commons license makes this easy.

If data is available in multiple places, different versions of the data might arise (e.g. changes between dates uploaded, data corruption). You should therefore make it easy to identify which specific version of the data is correct by publishing a digital fingerprint (e.g. a MD5 hash). MD5 fingerprints can be generated in Windows using MD5summer, in Linux with the Gnu md5sum utility and on Max OS X using md5 or openssl

Persistent IDs for data

In order to ensure ongoing access to your data, should look to acquire a persistent ID for your dataset. However, persistence is a continuum with some IDs more persistent than others. DOIs and handles are designed to be persistent in the long term, allowing a unique identifier to be redirected to the current location of your dataset - if the dataset moves, the DOI/handle can be pointed at the new location. Repositories and research data sites may provide DOIs for data submitted to them. Institutional URLs may be persistent if the institution makes a policy decision to make them so. Other URLs may change when web-sites are revamped making the published URL for your data return a "404 Not Found" message.

Persistent IDs are useful for referencing datasets, and are particularly handy if they are short. Long or ugly DOIs can be shortened using the ShortDOI service.

And more repositories

Repositories

The Digital Curation Centre have a (very short) list of repositories .

Repositories using DSpace can be registered on the DSpace web-site, for inclusion in the list of Who's using DSpace ? .

Within the University of London, the School of Advanced Study has a repository of humanities-related items.

University of the Arts London have an online repository

Edina provides a national data centre

EDINA is a UK national academic data centre, designated by JISC on behalf of UK funding bodies to support the activity of universities, colleges and research institutes in the UK, by delivering access to a range of online data services through a UK academic infrastructure, as well as supporting knowledge exchange and ICT capacity building, nationally and internationally.

Services hosted at EDINA include:

Pre-press e-Prints of articles can be published through http://arxiv.org/ and the related Computing Research Repository

Other repositories that may be of interest include:

NB: This list has been accumulated from various sites including:

Training the Trainers

I-Tech Training Toolkit

Performance Juxtaposition web-site:

ADDIE

http://www.learning-theories.com/addie-model.html

Kirkpatrick

Bloom

http://www.nwlink.com/~donclark/hrd/bloom.html

Cognitive, Affective and Psychomotor learning

Why do Data Management ?

Evidence Promoting Good Data Management

Data Reuse

Do you reuse other people's data ? Can they reuse your's ?

Researcher Development Framework

SCONUL Information Literacy 7 Pillars Diagrams

Licensing

Whose data is it anyway ?

QMUL HR Contract Terms and Conditions :

16. Patents & Copyright
a) Any discovery, design, computer software program or other work or invention which might reasonably be exploitable (‘Invention’) which is discovered, invented or created by the Employee (either alone or with any other person) either directly or indirectly in the course of their normal duties or in the course of duties specifically assigned to him in the course of his employment shall promptly be disclosed in writing to the College. All intellectual property rights in such Invention shall be the absolute property of the College and the College shall have the right to apply for, prosecute and obtain patent or other similar protection in its own name. Intellectual property rights include all patent rights, copyright and rights in respect of confidential information and know-how. The ownership of copyright in research papers, review articles and books will normally be waived by the College in favour of the author unless subject to any conditions placed on the works by the funder.

The important bit being...

Any ... work ... which might reasonably be exploitable ... which is ... created by the Employee ... in the course of duties ... in the course of his employment ... shall be the absolute property of the College

In the research contract, there is another clause:

The Employee will be expected to publish the results of his/her research work, subject to the conditions of any contract providing funding for the research

Therefore if funding bodies make funding contingent on publishing data as part of the results of research, then data publication will be allowed.

Research policies at QMUL Academic Registry and Council Secretariat

Creative Commons: http://wiki.creativecommons.org/Data CC Licenses / CC0

Science Commons: http://sciencecommons.org/projects/publishing/open-access-data-protocol/

Restrictions based on data ownership

Restrictions based on data parentage - use of e.g. CC-SA data

Article on CC-BY and data

Where possible, CC0 with a request for citations is preferred (Why does Dyad use CC0)

If data is based on copyright works it may be appropriate to restrict the license to allow only research / non-commercial use (e.g. this would prevent chord annnotations being published commercially).

Practical Steps Towards Data Management

Even if you don't have a readily available data repository, there are still steps you can take to manage your data even if it can't be published.

File formats - use open formats where possible to future-proof files.

File naming - give files meaningful names.

Metadata - include a plain-text README file describing the contents of the files.

License - include a plain-text LICENSE file describing the license for the dataset.

Check that a copy of your data will be backed up - e.g. check that the network drive you store your data on is actually backed up.

If you're really bothered about recovering your data make sure it's backed up off-site!

This could be (i) in the cloud (i.e. DropBox etc.); (ii) USB drive (hard/flash); (iii) a specific network location (e.g. a NAS box at home).

Repositories

The appropriate repository will partly depend upon the data.

It could be... C4DM RDR, Dryad, Flickr, figshare, Archiv.Org...

However, if you want data to be reused in a citable manner remember to package the license and the required citation with the data. It means that however the data reaches the final user the only excuse for not being able to cite the data is that someone has bothered to remove the info...

Open Source Learning Tools

Xerte

Media to use in Training

Disk Drives Break

DataCent collection of disk drive failure sounds

Laptops Break / Get Broken

Legislation

JISC Web2 Rights
JISC Legal

There are three main areas of law affecting data management:

In addition, for data stored in the cloud, the USA PATRIOT Act may be relevant.

Copyright

Copyright grants the copyright holder rights relating to the use of the copyright material, in addition certain moral rights are granted to the creator of the materials. Copyright is automatically granted when new creative material is produced - i.e. the material must be more than a simple collection of other data. Copyright is a separate item of property to the original work and the sale of the original work does not automatically pass copyright on to the new owner of that work (e.g. selling a score or painting does not automatically transfer the copyright). The particular rights and the duration of the copyright period are affected by the type of material.

For audio and digital music research, rights of particular interest relate to:
  • musical compositions and audio recordings - a CD can be covered by three separate copyrights, one for the design of the packaging, one for the sound recording on the CD and one for the musical composition recorded
  • typographical arrangements - these cover not only papers (which are also covered as literary works) but also the layout of spreadsheets and design of databases.

Pay The Piper has a very good post explaining music copyright, which includes:

If you compose a completely original piece of music then it is your own property - you own the copyright, in other words.

Arranging existing music is fraught with difficulties. To put it very simply (and this is indeed a gross simplification) until the composer has been dead for seventy years his music is copyright and you may not make a written arrangement of it without permission.

Lots more in the post though, so it's worth reading if you want to know more about music copyright!

It is important to note that copyright does not cover the ideas expressed within a work, only the particular form that that work has been captured in. The data within a spreadsheet is not copyright, only the particular layout of that data.

We note that simple anthologies - e.g. a collection of "complete works" or works created during a certain period - do not get copyright on the content, although the typographical layout may be copyright.

Fair dealing / fair use regulations allow specific uses of copies from original copyright materials (NB: not copies of copies!) without breaching copyright. However, fair use does not apply to sound recordings, films and broadcasts. There are JISC Guidelines for Fair Dealing in an Electronic Environment and specific clauses in the legislation on use in education in training or for personal study.

The legislation:

Moral Rights

The author of a work always retains two moral rights regarding the content:
  • The right to be identified as the author
  • The right to object to derogatory use of the material.

Database Rights

In the UK, if a "substantial investment" is made in "obtaining, verifying or presenting" the contents of a database then the database will be protected by database rights. The owner of those rights will be the person that "takes the initiative" in the creation of the database - that "person" being the employer if the database is made by an employee in the course of his work. Database rights are infringed by extraction or re-utilisation of a substantial part of the database.

Fair dealing rules exist for database rights - users of databases are allowed to extract data for non-commercial use in research and teaching (with acknowledgment of the source).

Database rights last for 15 years from the creation/publication of the database and may be renewed if the database changes substantially.

More information at: The act itself is at:

More Information

UK university materials regarding copyright and intellectual property: Further sources of information:

Some articles of interest from outside the UK

Australian IP law blog posts re. media and copyright. Includes: US articles from Public Domain Sherpa Tutorial on Copyright and the Public Domain
  • What makes a derivative work
    derivative must use enough of the prior work that the average person would conclude that it had been based on or adapted from the prior work
  • Compilations
    compilations are (c) if they show minimal creativity (e.g. not just all works by someone or by date)
  • Copyright Renewal
    Many works did not have copyright renewed and therefore went out of copyright and into the public domain in the US - estimated 15% of works had copyright renewed. Renewals will appear in the online US copyright database for works from 1950-1963,

CHM Super Sound (a South Pacific record company) state that :

A melodic phrase of a song is in copyright. The lyrics are in copyright. Chord progressions in a music composition however, are not copyright material.

University of Washington Copyright Connection

WIPO Understanding Copyright and Related Rights

Berne Convention for the Protection of Literary and Artistic Works

Chord Progressions and Copyright:

Data Protection

Data protection protects the rights of individuals over their personal information. In particular, The Data Protection Act covers the processing of data relating to identifiable living individuals. The core of the Data Protection Act is a set of data protection principles. These state that personal data shall be processed fairly and lawfully and shall not be processed unless the subject gave their consent except under specific conditions (for sensitive personal data such as marital status, ethnic origin or health information there are further restrictions). Fair and lawful processing requires that the data was not obtained by deception and is kept confidential and that the data subject was given information about who will process the data and for what purpose. In addition, personal data should be:
  • obtained only for specified purposes, and should not be used for anything else;
  • adequate, relevant and not excessive in relation to the purposes (i.e. only the data that is required);
  • accurate and, where necessary, kept up to date;
  • kept no longer than is necessary for the purposes;
  • processed in accordance with the rights of the data subjects under the Act;
  • protected from:
    • unauthorised or unlawful processing
    • and loss, destruction; or damage
  • shall not be transferred outside the European Economic Area without similar protection being provided.

In general, data subjects have a right to access to data held about them. The onus to provide this data is on QMUL as the data controller, and, as such, QMUL should be able to find any personal data relating to identifiable living individuals which is held within the college.

However, there is a specific exemption, for research which is not targeted at particular individuals and will not cause distress or damage to a data subject, which allows data to be processed for other purposes and held indefinitely. Data subjects also have no immediate right of access for personal data where the data is processed for research purposes and the results do not identify the data subjects.

JISC state:

Data controllers are required by the Act to process personal data only where they have a clear purpose for doing so, and then only as necessitated by that purpose. A data controller’s purpose for any personal data processing operation should thus be clearly set out in advance of the processing, and should be readily demonstrable to data subjects.

They also note:
  • that the majority of the Data Protection principles do apply to research data;
  • that there should be a review to ensure compliance with Data Protection requirements;
  • that a mechanism should be in place for subjects to object to the processing if they believe it would cause them damage or distress;
  • and that particular care must still be taken when processing involves sensitive data.

As data protection applies to identifiable living individuals, it is generally best practice to anonymise any data relating to individuals as soon as possible, discarding any information that allows individuals to be identified. In order to comply with the Data Protection Act, a suitable consent form should be provided allowing the use of data relating to identifiable living individuals in research. Alternatively, such consent may be recorded in interviews. Within QMUL, research which involves human participants and data relating to them should be approved by the college Research Ethics Committee - the fast-track ethics review should be sufficient for most C4DM research.

Further information: The Act:

Freedom Of Information

The Freedom Of Information Act (FoI) gives people the right to request data held by public bodies. It does not matter where the data originated, only who holds it. Copyright relating to information supplied under FoI requests remains unchanged - and provides you with protection from other people (mis)using your data.

The Freedom of Information Act states that research data:
  • can be held indefinitely;
  • is not subject to FoI requests unless individuals are identified in published research;
  • can be used for other research uses;
  • and may be exempt from FoI requests on grounds of (imminent) future publication or commercial interest.

Note that this means that if a researcher from another institution published research identifying individuals and you use their data, then individuals will have the right to request the data from QMUL.

Additionally, if data will be published through the college's normal publication scheme, then there is no onus on the college to provide the data under FoI requests - publishing data removes any additional requirements for FoI.

Further information: The Act:

USA PATRIOT Act

The 2001 USA PATRIOT Act provides the US government with the right to search/seize data held by any US company or its subsidiaries. It does not matter where the data is physically stored, if it is held by a US company (Microsoft, Apple, Google, DropBox, Amazon...) then the US government can seize the data. However, in order to do so it is necessary for the US government to obtain a court order for the purpose of an anti-terrorism investigation - they can't just idly decide to grab your data.

Note that these rights are not terribly different to the rights of other countries to access data (see Hogan Lovells' white paper).

Further information: The Act:
  • 2001 "Uniting and Strengthening America by Providing Appropriate Tools Required to Intercept and Obstruct Terrorism" (USA PATRIOT) Act (Link).

Files

Research Council Requirements

Research councils are requiring data management plans as part of grant proposals and their policies also stipulate that research data created through their funding should be published for other researchers to use.

The DCC provides an overview of funders' data policies and individual pages for each funder's policy. The London School of Hygiene and Tropical Medicine (LSHTM) have also published a report on funder requirements for data preservation and publication.

The AHRC and EPSRC policies are most relevant to work at C4DM.

Arts and Humanities Research Council (AHRC)

From AHRC Funding Guide (PDF downloadable from AHRC web-site)

Deposit of resources or datasets
Grant Holders in all areas must make any significant electronic resources or datasets created as a result of research funded by the Council available in an accessible and appropriate depository for at least three years after the end of their grant. The choice of repository should be appropriate to the nature of the project and accessible to the targeted audiences for the material produced.
If you are a Grant Holder in the area of archaeology and decide to deposit with The Archaeology Data Service (ADS), then you should consult them at or before the start of the proposed research to discuss and agree the form and extent of electronic materials to be deposited with the ADS. If the deposit occurs after 31 March 2013, then there will be charge for this deposit.

Self Archiving
The AHRC requires that funded researchers:
• ensure deposit of a copy of any resultant articles published in journals or conference proceedings in appropriate repository
• wherever possible, ensure deposit of the bibliographical metadata relating to such articles, including a link to the publisher’s website, at or around the time of publication.
Full implementation of these requirements must be undertaken such that current copyright and licensing policies, for example, embargo periods and provisions limiting the use of deposited content to non-commercial purposes, are respected by authors.

The DCC provides a summary of AHRC policy.

Engineering and Physical Sciences Research Council (EPSRC)

The EPSRC data management principles state that:
  • research data should be made freely available with as few restrictions as possible
  • data with long term value should remain accessible and usable for future research
  • metadata should be made available to enable other researchers to understand the potential for further research and re-use of the data
  • data management policies and plans should exist for all data – and be adhered to!
  • published results should always include information on how to access the supporting data
  • all users of research data should acknowledge the sources of their data

The DCC provides a summary of EPSRC policy.

MUSHRA

(Wikipedia)

ITU/BS standard BS.1534-1

Frameworks for creating MUSHRA tests:
  • MUSHRAM - Matlab interface for MUSHRA audio tests
  • MUSHRA patcher for Max/MSP
  • mushraJS HTML5 and JavaScript based framework to create MUSHRA listening tests

Additional Notes

Data ownership issues - who owns your research data ?

Mapping to Vitae RDF

Specifics on: using the C4DM RDR and where data can safely be stored at QM

Paul Lamere's The Tools We Use

Bibliographic data