changeset 16:e5c387d04f6e

Various edits including trying to get more structure to the discovery of barriers, and adding TODO notes...
author Chris Cannam
date Sun, 25 Sep 2011 10:50:52 +0100
parents d166744ca3d8
children 56627b8fcf4d
files Makefile cannam.tex
diffstat 2 files changed, 108 insertions(+), 66 deletions(-) [+]
line wrap: on
line diff
--- a/Makefile	Fri Sep 23 18:04:56 2011 +0100
+++ b/Makefile	Sun Sep 25 10:50:52 2011 +0100
@@ -1,4 +1,4 @@
 all: cannam.pdf
 
 cannam.pdf: cannam.tex refs.bib
-	( echo q | pdflatex cannam ) && bibtex cannam && pdflatex cannam && pdflatex cannam
+	( echo q | xelatex cannam ) && bibtex cannam && xelatex cannam && xelatex cannam
--- a/cannam.tex	Fri Sep 23 18:04:56 2011 +0100
+++ b/cannam.tex	Sun Sep 25 10:50:52 2011 +0100
@@ -3,7 +3,7 @@
 \def\CC{{C\nolinebreak[4]\hspace{-.05em}\raisebox{.4ex}{\tiny\bf ++}}}
 \raggedbottom
 
-\title{Sound Software: Towards Sustainable and Reusable Software in Audio and Music Research}
+\title{Sound Software: Towards Reusable Software in Audio and Music Research}
 
 \name{Chris Cannam, Luis Figueira and Mark Plumbley}
 \address{Centre for Digital Music,\\
@@ -38,6 +38,11 @@
 need to produce and validate their work. We aim to teach the skills
 they need and to provide facilities they can use to make their lives
 easier and their research more sustainable.
+
+TODO: More of the abstract! Better!
+
+TODO: Replace all [citation needed] with citations!
+
 \end{abstract}
 %
 \begin{keywords}
@@ -49,9 +54,8 @@
 \label{sec:intro}
 
 It is widely understood in the audio and music research area that much
-research will involve both the development of new software, and
-evaluation of new methods against prior publications that also
-involved the use of software.
+research will involve both development of new software, and evaluation
+of methods against earlier work that also made use of software.
 
 This presents two areas of difficulty.  First, researchers in the
 audio and music research community --- including within the group
@@ -63,15 +67,17 @@
 have the skills or desire to become involved in traditional software
 development practice or in publication of code.
 
-Second, there are many technical and logistical reasons why software
-developed during earlier research is no longer available for
-evaluation and subsequent development even if it has been
-published. These include platform incompatibilities and obsolescence,
-or legal limitations on distribution or reuse.
+Second, there are technical and logistical reasons why software
+developed during earlier research becomes unavailable for subsequent
+use or development even if it has been published. These include
+platform incompatibilities and obsolescence, or legal limitations on
+distribution or reuse.
 
-In this paper we will discuss some of these practical constraints on
+In this paper we will discuss some of the practical constraints on
 application of reproducible research principles for software code, and
-explore an incremental approach toward better practice.  Finally, we will make some recommendations for [...]
+explore an incremental approach toward better practice.  Finally, we
+will make some recommendations for research groups towards improving
+software development practice in their work.
 
 \section{Reproducible Research}
 \label{sec:rr}
@@ -93,24 +99,36 @@
 subject both appeared in 2009~\cite{vandewalle2009}. The IEEE Signal
 Processing society now encourages Reproducible Research, allowing
 links from the online journal repository IEEEXplore to the code and
-data so that other researchers can reproduce the results.
+data associated with a publication (TODO: check this). 
 
 Actions such as these promote the idea that research results in signal
 processing should be presented not simply as a printed paper, but as a
 {\it compendium} [citation needed] including the paper, research data,
-and code.  Vandewalle et al have also created a Reproducible Research
-Repository\footnote{\tt http://rr.epfl.ch/}, designed to promote
-reproducible research by requiring the authors of a paper to upload
-the code and data used in the experiments. Readers can also comment on
-a publication and evaluate the reproducibility of the work.
+and code.  Vandewalle et al [citation needed] have also created a
+Reproducible Research Repository\footnote{\tt http://rr.epfl.ch/},
+designed to promote reproducible research by requiring the authors of
+a paper to upload the code and data used in the experiments. Readers
+can also comment on a publication and evaluate the reproducibility of
+the work.
 
-Although the Reproducible Research principle provides a comprehensive
+Although the Reproducible Research principle proposes a comprehensive
 solution to the problem of code dissemination, our experience has been
-that take-up in the audio and music research field is limited.
+that take-up in the audio and music research field is limited.  Why?
 
 \section{Understanding real-world limitations on software practice}
 \label{sec:researchsoft}
 
+[We are going to propose three barriers to reuse -- lack of education
+  \& confidence; lack of tools \& facilities; platform
+  incompatibilities and code rot. We need to give the facts and
+  figures supporting these as barriers and then identify them.]
+
+A study by Hannay et al \cite{gwilson2009} found a great deal of
+variation in the level of understanding of standard software
+engineering concepts by scientists, and found that for developing and
+using scientific software, informal self-study or learning from peers
+was commonplace.
+
 In order to better understand the reality faced by the audio and music
 research community, we conducted an online survey on software usage
 and development~\cite{ssamrsurvey}.  This survey opened in October
@@ -119,21 +137,35 @@
 software usage, authorship and publication practices of researchers,
 with the aim of obtaining a number of individual case points for
 further examination as well as some broad numerical results. The
-survey, closed in April 2011, with 54 complete and 23 partially
+survey closed in April 2011, with 54 complete and 23 partially
 complete responses. There were responses from at least 16 different
 institutions.
 
-The majority of our respondents said either that they took no steps to
-ensure reproducibility in their publications or that they only made
-code or data available on request.  Obstacles cited included lack of
-time, copyright restrictions, and the potential for commercial use of
-the code.
+Although 44\% of respondents said that they took steps to ensure
+reproducibility of their work, their accompanying comments suggested
+various interpretations of the meaning of reproducibility.  A common
+theme was that code would be made available on personal request; some
+respondents said that they documented code in order to be able to
+reproduce the results themselves, or that they were planning to
+publish software or data rather than having actually done so.
 
-A broader study into science research across several subject areas by
-the UK Research Information Network \cite{rin2010} also identified
-lack of evidence of benefits, cultures of independence and
-competition, and quality concerns as typical inhibiting factors for
-open sharing of data and code.
+Our respondents cited as obstacles to the publication of code lack of
+time, copyright restrictions, and the potential for future commercial
+use. A broader study into science research across several subject
+areas by the UK Research Information Network \cite{rin2010}
+additionally identified lack of evidence of benefits, cultures of
+independence and competition, and quality concerns as typical
+inhibiting factors for open sharing of data and code.
+
+Our survey found that most respondents kept code on their own machines
+and did not develop collaboratively. This is consistent with the
+Hannay study, which found that scientists typically developed and used
+software on their personal computers rather than dedicated servers
+(TODO: check this, does it say anything about sharing code?).
+
+[This suggests that there are cultural and technical [lack of
+    facilities \& awareness of how to use them] barriers... this makes
+  intuitive sense because of the following]
 
 Undertaking reproducible research takes effort early in the research
 cycle.  This happens before the benefits are necessarily apparent and
@@ -142,16 +174,8 @@
 have been produced and a paper written, there is little apparent
 incentive to make the research reproducible.
 
-A study in 2009 \cite{gwilson2009} found a great deal of variation in
-the level of understanding of standard software engineering concepts
-by scientists, and found that for developing and using scientific
-software, informal self-study or learning from peers was commonplace.
-
-The same study found that scientists typically developed and used
-software on their personal computers rather than dedicated servers,
-reflecting our own survey which found that most respondents kept code
-on their own machines and did not develop collaboratively
-\cite{ssamrsurvey}.
+[Furthermore there is a barrier because of platform incompatibilities
+  as follows]
 
 In many of the fields within this community, researchers lack the
 skills or desire to write their own code or to make someone else's
@@ -204,8 +228,10 @@
 ultimate concern of our present work, therefore, is sustainability and
 reusability rather than reproducibility.
 
+[TODO: reword this following reword to sec \ref{sec:researchsoft}]
+
 We cannot address all possible barriers to software publication and
-reuse, but following section \ref{subsec:researchsoft} we identify
+reuse, but following section \ref{sec:researchsoft} we identify
 that we may be able to help in overcoming: lack of confidence in code
 quality and of comfort with collaborative development; lack of
 facilities and tools to support such development; and reusability
@@ -224,8 +250,9 @@
 materials~\cite{softwarecarpentry}.  This week-long residential course
 for 20 audio and music researchers from groups around the UK taught
 fundamentals of software development and good practice including
-version control, unit testing and test-driven development, Python
-syntax and structure, and managing experimental datasets with sqlite.
+version control for software, unit testing and test-driven
+development, Python syntax and structure, and managing experimental
+datasets with sqlite.
 
 \subsubsection{Videos and Tutorials}
 
@@ -236,23 +263,22 @@
 we say about this?)
 
 \subsection{Barrier: Lack of facilities and tools}
+\label{sec:lackoffacilities}
 
 Researchers will not make use of version control and collaborative
-development facilities if they are unaware that they exist.  An
-informal poll of attendees at the Autumn School (section
-\ref{sec:autumnschool}) showed that few of them were aware of such
-facilities being provided by their institutions, and those attendees
-who tried to discover after the course what facilities their
-institution could provide all reported failure.  This is consistent
-with the experience in our own group, where version control has been
-used sporadically and set up in an ad-hoc fashion, and also with
-feedback provided to our survey.
+development facilities that are not available to them, or of whose
+existence they are not aware.  An informal poll of attendees at the
+Autumn School (section \ref{sec:autumnschool}) showed that few of them
+were aware of such facilities being provided by their institutions.
+This is consistent with the feedback to our survey and with experience
+in our own group, where version control has been used sporadically and
+set up in an ad-hoc fashion.
 
 Attendees at the Autumn School also reported difficulty getting
 started with the complex user interfaces available for version
 control.  Nonetheless, version control was identified by attendees in
 debriefing as the most compelling subject taught during the course,
-suggesting that lack of awareness may be the main barrier to uptake.
+suggesting that lack of awareness may be the main barrier to uptake. (TODO: link to this)
 
 \subsubsection{SoundSoftware Code Site}
 \label{sec:codesite}
@@ -264,26 +290,42 @@
 is unable to help them or if they have a need to work with researchers
 at other institutions who would not be permitted access to their
 institution's facilities.  The existence of this site also addresses
-the shortcomings in our own group's former ad-hoc use of version
-control.  The site is implemented using our own custom version of the
-Redmine\footnote{\tt http://redmine.org/} project management
-application, with Mercurial version control.  Any UK researcher in the
-field can register and start their own collaborative projects using
-the version control, wiki, issue tracking, and other services
-provided.
+shortcomings in our own group's use of version control mentioned in
+section \ref{sec:lackoffacilities}.  The site is implemented using our
+own custom version of the Redmine\footnote{\tt http://redmine.org/}
+project management application, with Mercurial version control.  Any
+UK researcher in the field can register and start their own
+collaborative projects using the version control, wiki, issue
+tracking, and other services provided.
 
 Four aspects of our code site contribute to sustainability and utility
 for researchers:
 
+% Figures as of 24th Sept 2011:
+%
+% 118 projects, of which
+%  * 82 (69%) are top-level projects and 36 (31%) are subprojects
+%  * 51 projects (43%) and 34 top-level projects (41%) are public,
+%    67 projects (57%) and 48 top-level projects (59%) are private
+%
+% 88 users, of whom 46 have qmul in their email addresses (but note
+% that some QM researchers use non-QM addresses)
+%
+% average of 2.14 members per project (dividing number of rows in
+% members column with number in projects column), with averages of
+% 2.35 for public projects and 1.97 for private ones
+
 \begin{enumerate}
 \item {\em Focus} --- The focus of the site on audio and music
-  research makes it easier to locate and obtain code for reuse;
+  research may make it easier to locate and obtain code for reuse.
 \item {\em Linking publications with code} --- Users can associate
   publication records with their projects so that readers can
-  immediately see what publications are relevant;
-\item {\em Public and private projects} --- Researchers can use the
-  site for code management even for projects they do not intend to
-  publish, or can start a project privately and open it later;
+  immediately see what publications are related to the code.
+\item {\em Public and private projects} --- Projects can be entirely
+  public, or private to a group of collaborating researchers; work can
+  also be started privately and made public later.  At the time of
+  writing, 57\% of projects hosted at the site are private, and the
+  average number of members in private projects is 1.97.
 \item {\em Tracking external projects} --- Researchers who use code
   hosting or project management facilities elsewhere can also make use
   of our site as a nexus for relevant projects, as the site does not