Mar 102013

Error_404I’ve been thinking about sustainable and accessible archiving of bioinformatics software, I’m pretty scandalized at the current state of affairs, and had a bit of a complain about it before. I thought I’d post some links to other people’s ideas and talk a bit about the situation and action that is needed right now.

Casey Bergman wrote an excellent blog post (read the comments too) and created the BioinformaticsArchive on GitHub. There is a Storify of tweets on this topic.

Hilmar Lapp posted on G+ on the similarity of bioinformatics software persistence to the DataDryad archiving policy implemented by a collection of evolutionary biology journals. That policy change is described in a DataDryad blog post here: and the policies with links to the journal editorials here

The journal Computers & Geosciences has a code archiving policy and provides author instructions (PDF) for uploading code when the paper is accepted.

So this is all very nice, many people seem to agree its important, but what is actually happening? What can be done? Well Casey has led the way with action rather than just words by forking public GitHub repositories mentioned in article abstracts to BioinformaticsArchive. I really support this but we can’t rely on Casey to manage all this indefinitely, he has (aspirations) to have a life too!

What I would like to see

My thoughts aren’t very novel, others have put forward many of these ideas:

1. A publisher driven version of the Bioinformatics Archive

I would like to see bioinformatics journals taking a lead on this. Not just recommending but actually enforcing software archiving just as they enforce submission of sequence data to GenBank. A snapshot at time of publication is the minimum required. Even in cases where the code is not submitted (bad), an archive of the program binary so it can actually be found and used later is needed. Hosting on authors’ websites just isn’t good enough. There are good studies of how frequently URLs cited in the biomed literature decay with time (17238638) and the same is certainly true for links to software. Use of the standard code repositories is what we should expect for authors, just as we expect submission of sequence data to a standard repository not hosting on the authors’ website.

I think there is great merit to using a GitHub public repository owned by a consortium of publishers and maybe also academic community representatives. Discuss. An advantage of using a version control system like GitHub is that it would apply not too subtle pressure to host code rather than just the binary.

2. Redundancy to ensure persistence in the worst case scenario

Archive persistence and preventing deletion is a topic that needs careful consideration. Casey discusses this extensively; authors must be prevented from deleting the archive either intentionally or accidentally. If the public repository was owned by the journals’ “Bioinformatics Software Archiving Consortium” (I just made up this consortium, unfortunately it doesn’t exist) then authors could not delete the repository. Sure they could delete their own repository, but the fork at the community GitHub would remain. It is the permanent community fork that must be referenced in the manuscript, though a link to the authors’ perhaps more up to date code repository could be included in the archived publication snapshot via a wiki page, or README document.

Perhaps this archive could be mirrored to BitBucket or similar for added redundancy? FigShare and DataDryad could also be used for archiving, although it would be suboptimal re-inventing the wheel for code. I would like to see FigShare and DataDryad guys enter the discussion and offer advice since they are experts at data archiving.

3. The community to initiate actual action

A conversation with the publishers of bioinformatics software needs to be started right now. Even just PLOS, BMC, and Oxford Journals adopting a joint policy would establish a critical mass for bioinformatics software publishing. I think maybe an open letter signed by as many people as possible might convince these publishers. Pressure on Twitter and Google+ would help too, as it always does. Who can think of a cool hashtag? Though if anyone knows journal editors an exploratory email conversation might be very productive too. Technically this is not challenging, Casey did a version himself at BioinformaticsArchive. There is very little if any monetary cost to implementing this. It wouldn’t take long.

But can competing journals really be organised like this? Yes, absolutely for sure, there is clear precedent in the 2011 action of >30 ecology and evolutionary biology journals. Also, forward-looking journals will realize it is their interests to make this happen. By implementing this they will seem more modern and professional by comparison to journals not thinking along these lines. Researchers will see strict archiving policy as a reason to trust publications in those journals as more than just ephemeral vague descriptions. These will become the prestige journals, because ultimately we researchers determine what the good journals are.

So what next? Well I think gathering solid advice on good practice is important, but we also need action. I’d discussions with the relative journals ASAP. I’m really not sure if I’m the best person to do this, and there may be better ways of doing it than just blurting it all out in a blog like this, but we do need action soon. It feels like the days before GenBank, and I think we should be ashamed of maintaining this status quo.


Nov 172012

Mendeley, Zotero, PapersCiteulike, and others are all playing the social reference management game. You store your PDFs in their excellent programmes and then you can start to be social; form groups, browse subject categories, subscribe to other people’s reference lists. My question is this: is the current implementation the best way forward for us, the users, or is it driven by the interests and/or old-fashioned thinking of the providers. For those of you just looking for the punchline, I think the answer is the latter, best for them, not us. We should be learning lessons from social networks, and the open-science movement, and adopting open standards before some FaceBook-like behemoth emerges and dominates social scientific reference management for the foreseeable future.

Social Reference Management

Reading scientific literature is at the heart of all science, and whoever you are there is always a little paranoia that you haven’t read the latest important thing in your area. The social web has a lot to contribute here. When I find an interesting reference I click to add it to my (say) Mendeley library, and subsequently post it to a shared group folder such as “Gene Duplication” or “Stuff Dave is Reading”. Other people can then follow that collection from within Mendeley, but outside of the Mendeley ecosystem it becomes much more difficult. I’m not picking on Mendeley here, they have done a lot of hard work in creating some excellent software, and they are a commercial company after all. But wouldn’t it be better for me, and for science, if I could subscribe to lots of people on lots of networks without barriers?

Social Networks

It’s almost impossible to consider this topic without introducing analogies to social networks like Google+, Facebook, and Twitter. In these areas however there has been a lot of talk about privacy, data ownership, and freedom of connectivity. If I am on FaceBook I can interact with other FaceBook users just fine, but how can I bring in someone on a different social network? I can’t, I’m locked in and able to operate only within the environment designed by FaceBook. Similarly there are issues of data ownership- FaceBook has famously caused some concern regarding who owns the photos that are uploaded. But there is another side to data ownership, that is not the ownership of the data itself but of the data’s organisation. I might be reluctant to leave a social network, not because I can’t keep a list of friends, but because I have dozens of people partitioned into different groups, organised so that I can follow work-related contacts in a different way from drinking buddies.

Open, Distributed, Semantic Social Networks

The web discussion of social networks have lead in one very promising direction- away from locked in “data silos” towards a consideration of open standards where different social networks speak the same language and you can communicate across networks. In addition there are much fewer concerns about what rights to the content you post you are giving away if that content stays on your server, or otherwise in your control. Similarly, if we rely on the export tools provided by a service provider, they are always likely to be poor. Why invest time developing tools to make it easier to leave your network? Sure you have to have an export option, otherwise you will get bad press, but the rabble won’t rouse up across the web just because your export is mediocre and loses much of the organisational content people have invested in. And its not like people can fix this themselves, because you control the software environment. There are many open distributed social networks including Diaspora, OneSocialWeb, gnusocial and Friendica although they have struggled to make significant inroads in terms of user base and some are no longer in active development. What if we used their code for open scientific bibliographic social networking?

Open Social Reference Management is important

I would like an open standard social system of reference sharing. This could be done within their current system, it might not even look that different to the present, it could certainly be built to match the look and feel of whatever reference software system you are in. But the important difference is that you could follow a group, or recommend reading to anyone, no matter what system they were using.

Yes this would be difficult, but not very difficult. The open licences of some of the existing social software could be challenging for commercial ventures to include (though perhaps not Zotero, which is open source). Linking the reference to the PDF could again be challenging, especially if your business model is based on selling user storage. There are many other things that could also be challenging, but hey, really big advances are always difficult. I currently find a lot of references on twitter and G+ but they are disconnected from my library of literature, sure they are social, but nothing more.

Hyperbole alert: Science is built on knowledge. Reading that paper, which sparks the idea or makes the link that eventually produces (name of the coolest advance you can think of). The group that builds the infrastructure that truly links scientists’ reading and social knowledge-sharing across the world will save lives, and protect biodiversity, and build rockets to Mars. What could be cooler, or more important than that? And all we need is Diaspora* for journal articles, to link Mendeley to Zotero, to CiteUlike to Papers, how hard is that really?


This is worth reading: “A flock of twitters: decentralized semantic microblogging

Jul 222011

For those of you who haven’t come across it before Bio-Linux is an operating system set up for bioinformatics with a huge number of programs pre-installed. It can be obtained (for free) from the NERC Environmental Bioinformatics Centre. I’ve spent quite a while recently messing with installations of software packages and wanted to see how everything would work in a pre-installed environment. You can obtain a USB drive from NERC and boot from that, but it doesn’t work for OSX. Also, I wasn’t sure that I wanted to reboot each time as I may need to flip backwards and forwards between applications in Bio-Linux and OSX. Here I document a few experiments with installing and running Bio-Linux within OSX (so I don’t have to re-boot) using VirtualBox.

Here are a few choice quotes about Bio-Linux

Bio-Linux 6 packs a wealth of bioinformatics tools, scientific software and documentation into a powerful and user-friendly 64-bit Ubuntu Linux system. Download Bio-Linux today and turn your PC into a powerful workstation in minutes.

Bio-Linux 6.0 is a fully featured, powerful, configurable and easy to maintain bioinformatics workstation. Bio-Linux provides more than 500 bioinformatics programs on an Ubuntu Linux 10.04 base. There is a graphical menu for bioinformatics programs, as well as easy access to the Bio-Linux bioinformatics documentation system and sample data useful for testing programs. You can also install Bio-Linux packages to handle new generation sequence data types.

FYI: I’m running OSX 10.6.8 (Snow Leopard) on a MacPro with 4GB RAM and 2x 2.8GHz Quad-Core Intel Xeon processors. The list below is going to take >1 hour.

Here’s what I did to install

  1. Download and install VirtualBox from
  2. Download Bio-Linux6 (2.2 GB) from Since this is a free, supported, software paid for by the UK taxpayer it would be really great for NBAF-W if you registered so that they can say ‘X people have downloaded this software’. Also please cite the paper (Field et al 2006) when you can.
  3. Open VirtualBox and click “New” from the toolbar. Follow the installation Wizard.
  4. Give your virtual machine a name like “BioLinux”, choose Linux as the operating system, and select Ubuntu 64 bit as the version.
  5. Select the amount of RAM to give it- 1024MB should be OK, 512MB the default could be a bit mean. More RAM is always better, especially if you are going to set it to do a lot of hard work. This can always be changed later.
  6. Virtual Hard Disk- use the defaults (create new), and again on the next screen (VDI).
  7. Virtual disk storage details. “Dynamically allocated” is the default and I used this first time out. I suspect that it was the cause of slowness though and changed to “Fixed size” next time through. Certainly if you go for Dynamically allocated make sure to give it enough space on the following screen.
  8. VD file location and size- I used 8GB and Dynamic first time through and it was immediately short of space after I did a system update. I would definitely choose 16GB if you have the space on your HD. When I compared the two this 16GB fixed size felt much faster.
  9. The next screen is a summary and now you can press “Create” to create your virtual disk. If you have chosen “Fixed size” it will take a little while to create this virtual disk (5-10 mins) but will likely run faster in the future. At the end of the process you come back to exactly the same summary screen as at the start, with no indication that anything has happened. If you press the “Create” button again though it immediately updates to show you your new virtual disk in the VirtualBox Manager window.
  10. You can now press the Green Start arrow in the toolbar to launch it. You will now get a “First Run Wizard”.
  11. Select Installation Media. Now is the time to select the operating system that you specified in step 4, ie point it towards your download of BioLinux. If you click on the little folder icon to the right of the drop-down menu you can select your BioLinux file. Use the dropdown in the file list window to select “RAW (*.iso *.cdr)” as your BioLinux is an .iso file. Check your downloads folder to locate it. At this point it is very easy (I did it 4 times across 2 installs) to click on something that causes the screen to freeze and bleep whenever you click on anything. The Esc key solved this for me. Be careful where you click! When you have selected the file you should be back at the Select Installation Media dialog with bio-linux-6-latest.iso now selected. Continue.
  12. The next screen claims that you are installing your file from CD/DVD, ignore that, you know the truth. Click Start.
  13. You should now get an Ubuntu window and wait a couple of minutes before it boots and you see the BioLinux desktop and the install window.
  14. Choose your language and “Install Bio-Linux 6″ at the bottom. Don’t click on “Try Bio-Linux”. Then select time zone.
  15. Keyboard layout “Choose your own” then select “United Kingdom Macintosh” from the right panel.
  16. Accept the defaults, then add your name and password. I set it to log in automatically here.
  17. Now click INSTALL. Almost done. It will take a few minutes to install, go and have a coffee.
  18. “Installation complete- you need to restart the computer.” This refers only to the virtual computer. Restart. “Please remove the disc and close the tray (if any) then press enter”. This is because the software still thinks you are installing from a DVD. Ignore it, and press enter, Ubuntu Bio-Linux will boot.
Congratulations, all done, you are ready to go off and play with it.
There might be a few things you want to do in this new operating system.
  • You should probably set a network proxy: System –> Preferences –> Network proxy. Similarly you might want to use the “Ignored hosts” tab to exclude your university domain “*” in my case
  • You might want to update the system software. System –> Administration –> Update manager.
  • You might want to go to the VirtualBox Manager window  and click on Shared Folders. Then add a folder from your HD where you want to keep data accessible to both operating systems. I set mine to Auto-mount when I log in. I don’t think this works until you have restarted Bio-Linux.


You may also find a preconfigured VirtualBox BioLinux image, but at the time I wrote this it wasn’t the latest version (v5). It might be worth checking.
Many thanks to Steve Moss who introduced me to VirtualBox, helped me install this, and showed me some useful stuff.
Vested interest? I am on the NERC Biomolecular Analysis Facility (NBAF) steering committee, which has a role in oversight of NBAF-W who created Bio-Linux. I don’t feel in any way biased by this, but hey, you decide.
Jul 312008

I came across a nice program by Heroen Verbruggen called TreeGradients.

“TreeGradients is a tree drawing program. The tree drawing options are fairly basic but the program has the ability to plot several types of continuous variables at the nodes in colors and use linear color gradients to fill the branches between nodes. The output format is SVG (scalable vector graphics), which can be imported in most vectorial drawing software.”

It looks like Heroen is particularly interested in plotting continuous variables across trees. The part that immediately interested me was the ability to colour internal nodes by bootstrap (or Bayesian) support. In the example on the website poor support is given by pale greys along a gradient to strong support as black. When dealing with very large trees this is a nice visual trick to focus the mind on areas that are well supported and away from poorly supported areas (by making these less visible). Colours and presence of numeric bootstrap values can be adjusted to taste. The program is actually a pair of perl scripts distributed under an open-source GNU General Public Licence. I want to congratulate the author for making these open-source.

I haven’t actually tried it out yet but thought I’d flag it up now rather than my usual habit of waiting and waiting until I could review it properly (and my backlog is running at about 6 months now).

Jun 272008

The script I referred to in my last post is actually written by Olaf Bininda-Emonds, with a few minor modifications to send the output directly to phyml. I thought I would flag up his site which has a large and very useful collection of perl scripts for phylogenetic data wrangling. These are open-source scripts and I frequently find myself using and modifying these programs. Thanks Olaf!

Mar 132008

Following on from my previous post I decide to try Google Maps as an interface to large phylogenetic trees. This was a very quick and dirty go at seeing whether it would work as a navigable interface. I tried the implementation at MapLib which allows you to upload your own images and use Google Maps to explore them. So I uploaded some PNG images generated from big ARB trees. It worked quite well. Unfortunately there are some restrictions on image size that can be uploaded to this site so a thorough test of zooming about huge trees wasn’t really possible. But the image here is a screenshot of a smallish tree.

So, I realize that this doesn’t meet many of my own suggestions for getting information from large trees but it does have some interesting possibilities as a simple browser with a good user interface.

Mar 122008

It seems that Genome Projector has swept the blogosphere over the last 24 hours. I’ve seen it listed on many of the blogs I’m reading. It looks very good and intuitive. I just wanted to mention a couple of things.
This is a beautiful example of what happens when open source software is championed. Google maps has essentially been reworked as a genome browser. Beautiful.

“Development API is available for Google Map View! Any image (in almost any format including GIF, PNG, JPEG, BMP and even SVG) of any size can be readily converted to zoomable image using generateGMap() API distributed within the G-language Genome Analysis Environment with open-source GNU General Public License.”

Second, is there any future in the Google API for tree viewing? BioPerl will quickly convert any newick file to SVG. It is clearly mature code. It includes a little inset to see where you are in the genome (tree). Works in most browsers. Zoom and move is the basis of most tree navigation. Search and mark locations (taxa) are already available. It might be worth a look.