Mar 102013
 

Error_404I’ve been thinking about sustainable and accessible archiving of bioinformatics software, I’m pretty scandalized at the current state of affairs, and had a bit of a complain about it before. I thought I’d post some links to other people’s ideas and talk a bit about the situation and action that is needed right now.

Casey Bergman wrote an excellent blog post (read the comments too) and created the BioinformaticsArchive on GitHub. There is a Storify of tweets on this topic.

Hilmar Lapp posted on G+ on the similarity of bioinformatics software persistence to the DataDryad archiving policy implemented by a collection of evolutionary biology journals. That policy change is described in a DataDryad blog post here: http://blog.datadryad.org/2011/01/14/journals-implement-data-archiving-policy/ and the policies with links to the journal editorials here http://datadryad.org/pages/jdap

The journal Computers & Geosciences has a code archiving policy and provides author instructions (PDF) for uploading code when the paper is accepted.

So this is all very nice, many people seem to agree its important, but what is actually happening? What can be done? Well Casey has led the way with action rather than just words by forking public GitHub repositories mentioned in article abstracts to BioinformaticsArchive. I really support this but we can’t rely on Casey to manage all this indefinitely, he has (aspirations) to have a life too!

What I would like to see

My thoughts aren’t very novel, others have put forward many of these ideas:

1. A publisher driven version of the Bioinformatics Archive

I would like to see bioinformatics journals taking a lead on this. Not just recommending but actually enforcing software archiving just as they enforce submission of sequence data to GenBank. A snapshot at time of publication is the minimum required. Even in cases where the code is not submitted (bad), an archive of the program binary so it can actually be found and used later is needed. Hosting on authors’ websites just isn’t good enough. There are good studies of how frequently URLs cited in the biomed literature decay with time (17238638) and the same is certainly true for links to software. Use of the standard code repositories is what we should expect for authors, just as we expect submission of sequence data to a standard repository not hosting on the authors’ website.

I think there is great merit to using a GitHub public repository owned by a consortium of publishers and maybe also academic community representatives. Discuss. An advantage of using a version control system like GitHub is that it would apply not too subtle pressure to host code rather than just the binary.

2. Redundancy to ensure persistence in the worst case scenario

Archive persistence and preventing deletion is a topic that needs careful consideration. Casey discusses this extensively; authors must be prevented from deleting the archive either intentionally or accidentally. If the public repository was owned by the journals’ “Bioinformatics Software Archiving Consortium” (I just made up this consortium, unfortunately it doesn’t exist) then authors could not delete the repository. Sure they could delete their own repository, but the fork at the community GitHub would remain. It is the permanent community fork that must be referenced in the manuscript, though a link to the authors’ perhaps more up to date code repository could be included in the archived publication snapshot via a wiki page, or README document.

Perhaps this archive could be mirrored to BitBucket or similar for added redundancy? FigShare and DataDryad could also be used for archiving, although it would be suboptimal re-inventing the wheel for code. I would like to see FigShare and DataDryad guys enter the discussion and offer advice since they are experts at data archiving.

3. The community to initiate actual action

A conversation with the publishers of bioinformatics software needs to be started right now. Even just PLOS, BMC, and Oxford Journals adopting a joint policy would establish a critical mass for bioinformatics software publishing. I think maybe an open letter signed by as many people as possible might convince these publishers. Pressure on Twitter and Google+ would help too, as it always does. Who can think of a cool hashtag? Though if anyone knows journal editors an exploratory email conversation might be very productive too. Technically this is not challenging, Casey did a version himself at BioinformaticsArchive. There is very little if any monetary cost to implementing this. It wouldn’t take long.

But can competing journals really be organised like this? Yes, absolutely for sure, there is clear precedent in the 2011 action of >30 ecology and evolutionary biology journals. Also, forward-looking journals will realize it is their interests to make this happen. By implementing this they will seem more modern and professional by comparison to journals not thinking along these lines. Researchers will see strict archiving policy as a reason to trust publications in those journals as more than just ephemeral vague descriptions. These will become the prestige journals, because ultimately we researchers determine what the good journals are.

So what next? Well I think gathering solid advice on good practice is important, but we also need action. I’d discussions with the relative journals ASAP. I’m really not sure if I’m the best person to do this, and there may be better ways of doing it than just blurting it all out in a blog like this, but we do need action soon. It feels like the days before GenBank, and I think we should be ashamed of maintaining this status quo.

Bibliography

Jul 222011
 

For those of you who haven’t come across it before Bio-Linux is an operating system set up for bioinformatics with a huge number of programs pre-installed. It can be obtained (for free) from the NERC Environmental Bioinformatics Centre. I’ve spent quite a while recently messing with installations of software packages and wanted to see how everything would work in a pre-installed environment. You can obtain a USB drive from NERC and boot from that, but it doesn’t work for OSX. Also, I wasn’t sure that I wanted to reboot each time as I may need to flip backwards and forwards between applications in Bio-Linux and OSX. Here I document a few experiments with installing and running Bio-Linux within OSX (so I don’t have to re-boot) using VirtualBox.

Here are a few choice quotes about Bio-Linux

Bio-Linux 6 packs a wealth of bioinformatics tools, scientific software and documentation into a powerful and user-friendly 64-bit Ubuntu Linux system. Download Bio-Linux today and turn your PC into a powerful workstation in minutes.

Bio-Linux 6.0 is a fully featured, powerful, configurable and easy to maintain bioinformatics workstation. Bio-Linux provides more than 500 bioinformatics programs on an Ubuntu Linux 10.04 base. There is a graphical menu for bioinformatics programs, as well as easy access to the Bio-Linux bioinformatics documentation system and sample data useful for testing programs. You can also install Bio-Linux packages to handle new generation sequence data types.

FYI: I’m running OSX 10.6.8 (Snow Leopard) on a MacPro with 4GB RAM and 2x 2.8GHz Quad-Core Intel Xeon processors. The list below is going to take >1 hour.

Here’s what I did to install

  1. Download and install VirtualBox from http://www.virtualbox.org/wiki/Downloads
  2. Download Bio-Linux6 (2.2 GB) from http://nebc.nerc.ac.uk/tools/bio-linux/bl_download. Since this is a free, supported, software paid for by the UK taxpayer it would be really great for NBAF-W if you registered so that they can say ‘X people have downloaded this software’. Also please cite the paper (Field et al 2006) when you can.
  3. Open VirtualBox and click “New” from the toolbar. Follow the installation Wizard.
  4. Give your virtual machine a name like “BioLinux”, choose Linux as the operating system, and select Ubuntu 64 bit as the version.
  5. Select the amount of RAM to give it- 1024MB should be OK, 512MB the default could be a bit mean. More RAM is always better, especially if you are going to set it to do a lot of hard work. This can always be changed later.
  6. Virtual Hard Disk- use the defaults (create new), and again on the next screen (VDI).
  7. Virtual disk storage details. “Dynamically allocated” is the default and I used this first time out. I suspect that it was the cause of slowness though and changed to “Fixed size” next time through. Certainly if you go for Dynamically allocated make sure to give it enough space on the following screen.
  8. VD file location and size- I used 8GB and Dynamic first time through and it was immediately short of space after I did a system update. I would definitely choose 16GB if you have the space on your HD. When I compared the two this 16GB fixed size felt much faster.
  9. The next screen is a summary and now you can press “Create” to create your virtual disk. If you have chosen “Fixed size” it will take a little while to create this virtual disk (5-10 mins) but will likely run faster in the future. At the end of the process you come back to exactly the same summary screen as at the start, with no indication that anything has happened. If you press the “Create” button again though it immediately updates to show you your new virtual disk in the VirtualBox Manager window.
  10. You can now press the Green Start arrow in the toolbar to launch it. You will now get a “First Run Wizard”.
  11. Select Installation Media. Now is the time to select the operating system that you specified in step 4, ie point it towards your download of BioLinux. If you click on the little folder icon to the right of the drop-down menu you can select your BioLinux file. Use the dropdown in the file list window to select “RAW (*.iso *.cdr)” as your BioLinux is an .iso file. Check your downloads folder to locate it. At this point it is very easy (I did it 4 times across 2 installs) to click on something that causes the screen to freeze and bleep whenever you click on anything. The Esc key solved this for me. Be careful where you click! When you have selected the file you should be back at the Select Installation Media dialog with bio-linux-6-latest.iso now selected. Continue.
  12. The next screen claims that you are installing your file from CD/DVD, ignore that, you know the truth. Click Start.
  13. You should now get an Ubuntu window and wait a couple of minutes before it boots and you see the BioLinux desktop and the install window.
  14. Choose your language and “Install Bio-Linux 6″ at the bottom. Don’t click on “Try Bio-Linux”. Then select time zone.
  15. Keyboard layout “Choose your own” then select “United Kingdom Macintosh” from the right panel.
  16. Accept the defaults, then add your name and password. I set it to log in automatically here.
  17. Now click INSTALL. Almost done. It will take a few minutes to install, go and have a coffee.
  18. “Installation complete- you need to restart the computer.” This refers only to the virtual computer. Restart. “Please remove the disc and close the tray (if any) then press enter”. This is because the software still thinks you are installing from a DVD. Ignore it, and press enter, Ubuntu Bio-Linux will boot.
Congratulations, all done, you are ready to go off and play with it.
There might be a few things you want to do in this new operating system.
  • You should probably set a network proxy: System –> Preferences –> Network proxy. Similarly you might want to use the “Ignored hosts” tab to exclude your university domain “*.hull.ac.uk” in my case
  • You might want to update the system software. System –> Administration –> Update manager.
  • You might want to go to the VirtualBox Manager window  and click on Shared Folders. Then add a folder from your HD where you want to keep data accessible to both operating systems. I set mine to Auto-mount when I log in. I don’t think this works until you have restarted Bio-Linux.

Notes

You may also find a preconfigured VirtualBox BioLinux image, but at the time I wrote this it wasn’t the latest version (v5). It might be worth checking.
Many thanks to Steve Moss who introduced me to VirtualBox, helped me install this, and showed me some useful stuff.
Vested interest? I am on the NERC Biomolecular Analysis Facility (NBAF) steering committee, which has a role in oversight of NBAF-W who created Bio-Linux. I don’t feel in any way biased by this, but hey, you decide.
Jun 252008
 

I’m not a very competent perl programmer. Even writing the word programmer here makes me slightly embarrassed. I do carry out frequent sequence conversions and manipulations with perl scripts I’ve put together though. Sometimes when I need to run a script many times I’ve found the most irritating thing is launching the scripts and pointing it towards the right input file. A much simpler option in this case is to save the script as an application and drop the files onto it to carry out the conversion. I’ve come across two options for doing this (all this is very Mac-centric I’m afraid but I’d be interested to see MS equivalents in the comments).
The first is the open-source program Platypus by Sveinbjorn Thordarson that “can be used to create native, flawlessly integrated Mac OS X applications from interpreted scripts such as shell scripts or Perl and Python programs”. Make sure that the “is droppable” check box is selected. I found it quite straightforward to turn scripts into droppable applications this way. As it says on the site, but it needs some remembering, you will need to modify your script slightly to accept the infile correctly. The basic tutorial page says the following

Enabling “Is droppable” for an app will modify the property list for for the app in question so that it can receive dropped files in the Dock and Finder. These files are then passed on to the script as arguments via @ARGV. However, the first argument to the script ($ARGV[1], $1 etc., depending on your scripting language of choice) is always the path to the application bundle (for example “/Applications/MyPlatypusApp.app”).

Essentially this means that (in perl at least) where your input file would be identified right at the start by @ARGV[0] it should be changed to @ARGV[1] before creating your application.
Another interesting aspect is the ability to bundle in code files referred to in your script. This means for example that if you have a script that depends on bioperl, it needn’t break, just add in the path to the parts of bioperl needed.

The second option is an AppleScript droppable application. I have to admit that I have never written an AppleScript but I came across this post recently from TUAW outlining the “do script” command. Applescripts can be saved as dropplet applications onto which you drop input files. A bit of Googling reveals people using both do script “script.pl and do shell script “script.pl”. The last seems a bit odd since script.pl is a perl not shell script, but it looks like either will work.
As an example I once created a perl script that took an alignment in a range of formats and converted to a format acceptable to phyml, then ran the program using standard settings of my choice. I have this on my desktop as a droppable application called “runPhyml”. It works very nicely for generating quick trees.