CodeHosting » History » Version 5

Chris Cannam, 2010-09-20 04:19 PM

1 1 Chris Cannam
h1. The code hosting problem
2 1 Chris Cannam
3 1 Chris Cannam
h2. Assumptions found in my head
4 1 Chris Cannam
5 1 Chris Cannam
* Audio and music research groups in institutions *lack effective access to version control systems*
6 1 Chris Cannam
7 1 Chris Cannam
  * This is certainly historically true of C4DM; what about other groups?
8 1 Chris Cannam
9 1 Chris Cannam
* Researchers often want to *share their code selectively* with other researchers in the same field but in other institutions
10 1 Chris Cannam
11 1 Chris Cannam
  * Internal code hosting doesn't usually facilitate this
12 1 Chris Cannam
13 1 Chris Cannam
* Individual researchers may be happy to host their code in *existing public hosting services* (e.g. SourceForge, Google Code), but their supervisors are likely to be less keen
14 1 Chris Cannam
15 2 Chris Cannam
  * Supervisors don't necessarily appreciate these services' requirement that everything should be open source, and it's hard to keep track of what work your students are producing
16 1 Chris Cannam
17 2 Chris Cannam
  * The opposite dynamic may occur in some places -- researchers may be self-conscious about publishing code even when their supervisors encourage them to
18 2 Chris Cannam
19 1 Chris Cannam
_How can we test these assumptions?_
20 1 Chris Cannam
21 1 Chris Cannam
_If these assumptions are correct, how do we solve these problems?_
22 1 Chris Cannam
23 1 Chris Cannam
h3. We could encourage and train institutions to provide better internal code hosting facilities
24 1 Chris Cannam
25 2 Chris Cannam
For example, by providing nice recipes, templates, support etc for setting up well-featured, friendly services.  A good code management facility would bring together a version control system with a nice web front-end, project data sharing facilities (wiki etc), a sensible authentication system that doesn't involve a whole new username/password database, etc.
26 1 Chris Cannam
27 1 Chris Cannam
This is certainly likely to improve code development practice in an institution that has no facility at present. But it doesn't really solve the "selective sharing" problem, or help very much with the desire to move toward publication of software and reproduceable research -- unless we can also convince people to make their own internal hosting facility a public one.
28 1 Chris Cannam
29 2 Chris Cannam
Audio and music research groups typically are too small to be successfully running their own facilities.  To do this well, they really need a horizontal approach -- facilities provided to all research areas by a common CS or IT service.  This isn't necessarily the most effective approach if we want to improve search and access specifically to audio-related research code, but it may be the easiest approach to maintain.  This is (presumably?) the sort of thing that the general Software Sustainability Institute ought to be exercising itself with.
30 1 Chris Cannam
31 2 Chris Cannam
Some institutions will have a central system already.  How many?  Which?  Are they happy with it?  Can the SSI guess at any of these figures for us?  Would the existence of a working, if not ideal, internal facility make a group less likely to accept any other approach that we might propose?
32 1 Chris Cannam
33 2 Chris Cannam
h3. We could encourage institutions to make use of existing external facilities
34 1 Chris Cannam
35 2 Chris Cannam
Researchers are often familiar with services like Google Code, SourceForge, GitHub etc already, and in some cases may use them even for hosting code that is not really supposed to be published ("yet") if they have a need to share it with one or more individuals at other institutions.  If they are comfortable with doing that, why not encourage it -- since it also promotes open publication and has little or no maintenance cost?
36 1 Chris Cannam
37 5 Chris Cannam
These sites also generally don't address requirements for private hosting, for projects that are "not yet" ready for publication.  (GitHub and Bitbucket support private hosting, but GitHub requires payment for it and Bitbucket's free facility is very limited.)  It may (perhaps) be attractive to be able to persuade groups that their code should all be public from the start, but it's not very realistic, and in any case it's probably not wise to mix up a technical solution to a practical problem (use of version control during development) with promotion of a philosophical position (code should be published) during advocacy.
38 1 Chris Cannam
39 4 Chris Cannam
Also, keeping track of projects in these external facilities is hard -- both for prospective user/reusers who want to find stuff, and for institutions who want to keep track of the work that their researchers are producing.  We could perhaps help out by providing indexing and metadata services for projects through a central location.
40 1 Chris Cannam
41 3 Chris Cannam
All that said, these services work -- we don't want to find ourselves proposing methods that will be less attractive to motivated researchers.
42 2 Chris Cannam
43 2 Chris Cannam
h3. We could provide a dedicated facility
44 1 Chris Cannam
45 3 Chris Cannam
We could provide a new code hosting facility that provides private hosting and access control, so that in theory institutions can treat it as an internal facility that has the ability to "promote" their projects to public status when desired.
46 3 Chris Cannam
47 3 Chris Cannam
This could solve the problem of selective sharing and the problem of maintaining private code.  It could also store and provide more effective project metadata -- e.g. associate a project with its publications, or list all projects from a particular research lab -- making it easier to find, index, and consequently reuse project code.
48 3 Chris Cannam
49 3 Chris Cannam
But it does have some difficulties:
50 3 Chris Cannam
51 3 Chris Cannam
* Supervisors and other decision-makers would need to be reassured that they were not likely to be duped into publishing work they wanted to keep private
52 3 Chris Cannam
* Supervisors and other decision-makers would need to be reassured (through policy? through technical means?) that they were not simply giving away their institution's assets _to whoever was running the service_
53 3 Chris Cannam
* Researchers would need to perceive the facility as being at least as easy to use and effective as any of the existing external services
54 3 Chris Cannam
* The service would require an ongoing maintenance budget separate from any individual institutional budget (the self sustainability problem)
55 3 Chris Cannam
* Consequently, all users would need some sort of reassurance that they wouldn't lose all of their code and project metadata if the funding ran out
56 3 Chris Cannam
57 3 Chris Cannam
And other practical risks:
58 3 Chris Cannam
59 3 Chris Cannam
* It's possible that everyone would just create private projects, add a few selected other users, and never make them public at all
60 1 Chris Cannam
* There is a step of "due-diligence" that people generally undertake when preparing to publish something -- license headers and README files, checking who the code actually belongs to, etc -- which may get overlooked if the code starts out as private (users may be more inclined to check in any code they're working with, of whatever provenance) -- this also makes it more likely that the project will never become public, especially since even if the code is subsequently cleaned up, the history will remain
61 4 Chris Cannam
62 4 Chris Cannam
h3. Hybrid approaches
63 4 Chris Cannam
64 4 Chris Cannam
We may be able to combine the second and third approaches by encouraging people to use "whatever hosting facility suits -- and here's one of our own if you can't find one" and then also providing indexing, cross-references and metadata for external projects.  How hard is this to do, technically?
65 4 Chris Cannam
66 4 Chris Cannam
Is there any way to combine the first approach with any other?  Presumably, many institutions will have their own hosting facility, and advisory services like the SSI will quite reasonably be encouraging them to set one up.  Does having, and using, an internal service in fact risk making it harder to publish and share code?