Mar 102013

Error_404I’ve been thinking about sustainable and accessible archiving of bioinformatics software, I’m pretty scandalized at the current state of affairs, and had a bit of a complain about it before. I thought I’d post some links to other people’s ideas and talk a bit about the situation and action that is needed right now.

Casey Bergman wrote an excellent blog post (read the comments too) and created the BioinformaticsArchive on GitHub. There is a Storify of tweets on this topic.

Hilmar Lapp posted on G+ on the similarity of bioinformatics software persistence to the DataDryad archiving policy implemented by a collection of evolutionary biology journals. That policy change is described in a DataDryad blog post here: and the policies with links to the journal editorials here

The journal Computers & Geosciences has a code archiving policy and provides author instructions (PDF) for uploading code when the paper is accepted.

So this is all very nice, many people seem to agree its important, but what is actually happening? What can be done? Well Casey has led the way with action rather than just words by forking public GitHub repositories mentioned in article abstracts to BioinformaticsArchive. I really support this but we can’t rely on Casey to manage all this indefinitely, he has (aspirations) to have a life too!

What I would like to see

My thoughts aren’t very novel, others have put forward many of these ideas:

1. A publisher driven version of the Bioinformatics Archive

I would like to see bioinformatics journals taking a lead on this. Not just recommending but actually enforcing software archiving just as they enforce submission of sequence data to GenBank. A snapshot at time of publication is the minimum required. Even in cases where the code is not submitted (bad), an archive of the program binary so it can actually be found and used later is needed. Hosting on authors’ websites just isn’t good enough. There are good studies of how frequently URLs cited in the biomed literature decay with time (17238638) and the same is certainly true for links to software. Use of the standard code repositories is what we should expect for authors, just as we expect submission of sequence data to a standard repository not hosting on the authors’ website.

I think there is great merit to using a GitHub public repository owned by a consortium of publishers and maybe also academic community representatives. Discuss. An advantage of using a version control system like GitHub is that it would apply not too subtle pressure to host code rather than just the binary.

2. Redundancy to ensure persistence in the worst case scenario

Archive persistence and preventing deletion is a topic that needs careful consideration. Casey discusses this extensively; authors must be prevented from deleting the archive either intentionally or accidentally. If the public repository was owned by the journals’ “Bioinformatics Software Archiving Consortium” (I just made up this consortium, unfortunately it doesn’t exist) then authors could not delete the repository. Sure they could delete their own repository, but the fork at the community GitHub would remain. It is the permanent community fork that must be referenced in the manuscript, though a link to the authors’ perhaps more up to date code repository could be included in the archived publication snapshot via a wiki page, or README document.

Perhaps this archive could be mirrored to BitBucket or similar for added redundancy? FigShare and DataDryad could also be used for archiving, although it would be suboptimal re-inventing the wheel for code. I would like to see FigShare and DataDryad guys enter the discussion and offer advice since they are experts at data archiving.

3. The community to initiate actual action

A conversation with the publishers of bioinformatics software needs to be started right now. Even just PLOS, BMC, and Oxford Journals adopting a joint policy would establish a critical mass for bioinformatics software publishing. I think maybe an open letter signed by as many people as possible might convince these publishers. Pressure on Twitter and Google+ would help too, as it always does. Who can think of a cool hashtag? Though if anyone knows journal editors an exploratory email conversation might be very productive too. Technically this is not challenging, Casey did a version himself at BioinformaticsArchive. There is very little if any monetary cost to implementing this. It wouldn’t take long.

But can competing journals really be organised like this? Yes, absolutely for sure, there is clear precedent in the 2011 action of >30 ecology and evolutionary biology journals. Also, forward-looking journals will realize it is their interests to make this happen. By implementing this they will seem more modern and professional by comparison to journals not thinking along these lines. Researchers will see strict archiving policy as a reason to trust publications in those journals as more than just ephemeral vague descriptions. These will become the prestige journals, because ultimately we researchers determine what the good journals are.

So what next? Well I think gathering solid advice on good practice is important, but we also need action. I’d discussions with the relative journals ASAP. I’m really not sure if I’m the best person to do this, and there may be better ways of doing it than just blurting it all out in a blog like this, but we do need action soon. It feels like the days before GenBank, and I think we should be ashamed of maintaining this status quo.


Aug 272012

I’ve been reading a lot recently about reproducible research (RR) in bioinformatics on several blogs, and Google+ and Twitter. The idea is that it is important that someone is easily able to reproduce* your results (and even figures) from your publication using your provided code and data. I’ve been thinking that this is a movement that urgently needs to spread to phylogenetics research- Reproducible Research in Phylogenetics or RRphylo.

The current state of affairs

The problem is that although methods sections of phylogenetics papers are typically fairly clear, and probably provide all the information required to pretty much replicate the work, it would be a very time-consuming process- lots of hands on time, lots of manual data manipulation. Moreover, many of the settings are assumed (defaults) rather than explicitly specified. ‘Have I replicated what they did?’ would then be judged by a qualitative assessment of whether your tree looked like the one published, and thats not really good enough.

Why it matters

I’m assuming here that what is currently published will allow replication, though in some cases it might not, as Ross Mounce described in his blog. I have had experience of this too, but thats another long and painful story. So why does it matter? It matters because of the next step in research. If the result of your work is important you or someone else will probably want to add genes or species to the analysis, vary the parameters of phylogenetic reconstruction, or alignment, or the model of sequence evolution, or reassess the support values, or something else relevant. Unless the process of analysis has been explicitly characterised in a way that allows replication without extensive manual intervention and guesswork then this cannot be achieved. If you want your work to be a foundation for rapid advancement, rather than just a nice observation, then it must be done reproducibly.

What should be done?

Briefly; pipelines, workflows and/or script-based automation. It is quite possible to create a script-based workflow that recreates your analyses in their entirety, perhaps using one of the Open-Bio projects (e.g. BioPerl, BioPython) or DendroPy, or Hal. This would go from sequence download, manipulation and alignment (though this could be replaced by provision of an alignment file), to phylogenetic analysis, to tree annotation and visualisation. Such a script must, by definition, include all the parameters used at each step- and that is mostly why I prefer that it includes sequence retrieval and manipulation rather than an alignment file.

Phylogeneticists are sometimes much less comfortable with scripting than are bioinformaticians, but this is something that has no option but to change. The scale of phylogenomic data now appearing just cannot (and should not) be handled in the GUI packages that have typically enabled tree building (e.g. MEGA, DAMBE, Mesquite).

There is a certain attraction to GUI approaches though and something that may well increase are GUI workflow builders. The figure above is from Armadillo, which looks interesting, but unfortunately doesn’t seem to be able to save workflows in an accessible form, making it an inappropriate way forward. Galaxy is another good example, able to save standard workflows, but not (yet) well provisioned for phylogenetic analyses.

At the moment  a script-based approach linking together analyses is the best approach for RRphylo.

‘So you’ve been doing this, right?’

Err no, I’ve been bad, really I have. Some of my publications are very poor from the view of RRphylo. The reasons for that would take too long to go into, I’m not claiming the moral high ground here, but past mistakes do give me an appreciation of both the need and the challenges in implementation. I absolutely see that this is how I will have to do things in future, not just because (I hope) new community standards will demand it, but also because iterating over minor modifications of analysis is how good phylogenetics is done, and that is best implemented in an automated way.

Automated Methods sections for manuscripts?

One interesting idea, that is both convenient and rigorous, is to have the analysis pipeline write the first draft of your Methods section for you. An RRphylo script that fetched sequences from genbank, aligned them, built a phylogeny, and then annotated a tree figure should be able to describe this itself in a human-readable text format suitable for your manuscript.

The full Methods pipeline is archived at doi:12345678 and is described briefly as follows: sequences were downloaded from NCBI nucleotide database (27 August 2012) with the Entrez query cichlidae[ORGN] AND d-loop[TITL] and 300:1600[SLEN]. These 5,611 sequences were aligned with MUSCLE v1.2.3 with settings……

This is a brief and made up Methods that doesn’t contain the detail it could. Parameter values from the script have been inserted in blue. This sort of an output could also fit very well with the MIAPA project (Minimum Information About a Phylogenetic Analysis). NB it is not this methods information that needs to be distributed, it is the script that carried out these analyses (and produced this human-readable summary as an extra output) that is the real record of distribution.


This won’t be implemented tomorrow, even if everyone immediately agrees with me that it is really important. It is much easier for most people to just write the same old methods section they always have- a general description of what they did that people in the field will understand. I went today to read a lot of Methods sections from phylogeny papers. Some were better than others in describing the important details, but none sounded relevant to the new era of large scale analysis. They sounded like historical legacies, which of course is true of scientific paper writing in general.

It will take a community embarrassment to effect a change; an embarrassment that even the best papers in the field are still vague, still passing the methodological burden to the next researcher, still amateur compared to modern bioinformatics papers, still ultimately irreproducible.

The major barrier to RRphylo is the need to write scripts, a skill with which many phylogeneticists are unfamiliar and uncomfortable. This may be helped by Galaxy or the like, allowing the easy GUI linking of phylogenetic modules and publication of a standard format workflows to MyExperiment (I think Galaxy is the future, for reasons I won’t go into here). Alternatively, maybe some cutting-edge labs will put together a lot of scripts and user-guides allowing even moderately computer literate phylogeneticists to piece together a reproducible workflow. Hal and DendroPy seem the places to start at present, and I shall have to try them out as soon as I can. Other places for workflows that are worth investigating are Ruffus, Sumatra, and Snakemake. At the moment I’ve done a decent amount of Googling and absolutely no testing so I’d be really interested in other suggestions and views on these options.

I think that Reproducible Research in Phylogenetics is incredibly important. Not just to record exactly what you did, not just to make things easier for the next researcher, but because all science should be fully reproducible- of course it should. I’m coming round to the idea that not implementing RRphylo is equivalent to not releasing your new analysis package and just describing what the program does. But maybe I’m just a lone voice?

See also 

Our approach to replication in computational science, C Titus Brown

Reproducible Research: A Bioinformatics Case Study Robert Gentleman 2004

Next-generation sequencing data interpretation: enhancing reproducibility and accessibility‘ Nekrutenko & Taylor Nature Reviews Genetics 13, 667-672 (September 2012) | doi:10.1038/nrg3305 (subscription required)

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Goeks et al Genome Biology 2010, 11:R86 doi:10.1186/gb-2010-11-8-r86 (open access)

Reproducible research with bein, BiocodersHub

Sumatra also talks about pipelines acting as an Electronic Lab Book, here’s a presentation about it.

* I am not going to distinguish between replication and reproducibility in this blog post. See here. There are differences, but both are needed.

Jul 262011

I have a project going at the moment to examine changes in intron diversity, size and location in animal genomes. I am always a bit frustrated with the way introns are treated in many genome characterisation papers- “the genome contained Y introns with mean intron size Xbp” is usually all we get. This sort of summary stat can hide all manner of interesting trends. One measure that is often useful is intron density but unfortunately there doesn’t seem to be any standardised way to use this measure. Density is often measured as ‘introns per gene’, which is a reasonable shorthand, although since genes vary in length very considerable both within and between genomes it makes quantitative analysis very difficult indeed. I have seen ‘introns per 10kb’! This is OK, but what number to choose? What if each study chooses a different number? ‘Introns per nucleotide’ will standardise this better, and although the number will be very small, we seem to manage just fine with small mutation rate numbers and the like. But the more I think about it the less simple this seems.

Introns per bond

Something that is often overlooked when calculating the number of introns per nucleotide is that introns do not insert into nucleotides but rather the phosphodiester bonds between them. I would suggest therefore that the most accurate and effective way to specify density would be introns per bond. It seems reasonable that counting nucleotides is a convenient shorthand for this, but actually this shorthand leads to small but persistent errors. This is an unfortunate consequence of genome annotation restricting itself to nucleotides but genomic processes sometimes targeting bonds.

In the cartoon of a gene above CDS represents the protein coding region and UTR stands for the 5′ and 3′ untranslated regions. The dashes between nucleotides represent phosphodiester bonds joining the nucleotides. There are 6 nucleototides in the 5′-UTR, 9 nucleotides in the CDS and 5 nucleotides in the 3′-UTR. What would happen if we were to insert an intron in this sequence at the boundary of one of the gene regions? In which gene region would it be counted?

Coding regions almost always begin with a codon specifying a methionine residue- the start codon ATG. Nothing preceding this A nucleotide is counted as part of the CDS. Coding regions finish with a termination codon, TGA in the example above. This A nucleotide is the end of the CDS. By usual practice therefore any intron inserting into the bond between the T and the A at the 5′ end of the CDS would not be counted as part of the CDS, nor would any intron inserting into the bond between the A and the A at the 3′ end of the CDS. This is quite reasonable in many ways, but defining UTRs by reference to the CDS (after the last nucleotide, before the first nucleotide) means that the CDS has one less bond per nucleotide than do the UTRs! The 5′-UTR here has 6 nucleotides and 6 bonds, the 3′-UTR has 5 nucleotides and 5 bonds, but the CDS has 9 nucleotides and only 8 bonds where an inserting intron would be labelled as a ‘CDS intron’.

Does this matter?

Both yes and no. Counts using introns/nucleotide will be very similar to introns/bond. I am not claiming that work needs to be repeated or that substantial errors put into question previous work. But there are two issues here

  1. We should do it right. Understanding the actual insertion process requires us to use the right language. We should label introns as inserting between nucleotides to avoid confusion. You may not be confused, but try writing a script that counts introns when everything is labelled by nucleotide position.
  2. We can’t yet be sure what difference correct counts make in large data sets. The age of genomics is here. We can study hundreds of thousands of introns from lots of species and treat this mass of data statistically. The numbers of introns/nucleotide and introns/bond may look similar to our eyes, but trusting our savannah ape brains to make the right call is a risky strategy with big numbers.
The lab now uses these ‘per bond’ counts in our genomic intron scripts, which will be released when the first paper is out. I think it would be great if there was a biological standard for intron density, maybe we should even give it a unit- a Gilbert perhaps could equal one intron/10-3 bonds?
This post may be cited using the DOI: 

Jul 222011

For those of you who haven’t come across it before Bio-Linux is an operating system set up for bioinformatics with a huge number of programs pre-installed. It can be obtained (for free) from the NERC Environmental Bioinformatics Centre. I’ve spent quite a while recently messing with installations of software packages and wanted to see how everything would work in a pre-installed environment. You can obtain a USB drive from NERC and boot from that, but it doesn’t work for OSX. Also, I wasn’t sure that I wanted to reboot each time as I may need to flip backwards and forwards between applications in Bio-Linux and OSX. Here I document a few experiments with installing and running Bio-Linux within OSX (so I don’t have to re-boot) using VirtualBox.

Here are a few choice quotes about Bio-Linux

Bio-Linux 6 packs a wealth of bioinformatics tools, scientific software and documentation into a powerful and user-friendly 64-bit Ubuntu Linux system. Download Bio-Linux today and turn your PC into a powerful workstation in minutes.

Bio-Linux 6.0 is a fully featured, powerful, configurable and easy to maintain bioinformatics workstation. Bio-Linux provides more than 500 bioinformatics programs on an Ubuntu Linux 10.04 base. There is a graphical menu for bioinformatics programs, as well as easy access to the Bio-Linux bioinformatics documentation system and sample data useful for testing programs. You can also install Bio-Linux packages to handle new generation sequence data types.

FYI: I’m running OSX 10.6.8 (Snow Leopard) on a MacPro with 4GB RAM and 2x 2.8GHz Quad-Core Intel Xeon processors. The list below is going to take >1 hour.

Here’s what I did to install

  1. Download and install VirtualBox from
  2. Download Bio-Linux6 (2.2 GB) from Since this is a free, supported, software paid for by the UK taxpayer it would be really great for NBAF-W if you registered so that they can say ‘X people have downloaded this software’. Also please cite the paper (Field et al 2006) when you can.
  3. Open VirtualBox and click “New” from the toolbar. Follow the installation Wizard.
  4. Give your virtual machine a name like “BioLinux”, choose Linux as the operating system, and select Ubuntu 64 bit as the version.
  5. Select the amount of RAM to give it- 1024MB should be OK, 512MB the default could be a bit mean. More RAM is always better, especially if you are going to set it to do a lot of hard work. This can always be changed later.
  6. Virtual Hard Disk- use the defaults (create new), and again on the next screen (VDI).
  7. Virtual disk storage details. “Dynamically allocated” is the default and I used this first time out. I suspect that it was the cause of slowness though and changed to “Fixed size” next time through. Certainly if you go for Dynamically allocated make sure to give it enough space on the following screen.
  8. VD file location and size- I used 8GB and Dynamic first time through and it was immediately short of space after I did a system update. I would definitely choose 16GB if you have the space on your HD. When I compared the two this 16GB fixed size felt much faster.
  9. The next screen is a summary and now you can press “Create” to create your virtual disk. If you have chosen “Fixed size” it will take a little while to create this virtual disk (5-10 mins) but will likely run faster in the future. At the end of the process you come back to exactly the same summary screen as at the start, with no indication that anything has happened. If you press the “Create” button again though it immediately updates to show you your new virtual disk in the VirtualBox Manager window.
  10. You can now press the Green Start arrow in the toolbar to launch it. You will now get a “First Run Wizard”.
  11. Select Installation Media. Now is the time to select the operating system that you specified in step 4, ie point it towards your download of BioLinux. If you click on the little folder icon to the right of the drop-down menu you can select your BioLinux file. Use the dropdown in the file list window to select “RAW (*.iso *.cdr)” as your BioLinux is an .iso file. Check your downloads folder to locate it. At this point it is very easy (I did it 4 times across 2 installs) to click on something that causes the screen to freeze and bleep whenever you click on anything. The Esc key solved this for me. Be careful where you click! When you have selected the file you should be back at the Select Installation Media dialog with bio-linux-6-latest.iso now selected. Continue.
  12. The next screen claims that you are installing your file from CD/DVD, ignore that, you know the truth. Click Start.
  13. You should now get an Ubuntu window and wait a couple of minutes before it boots and you see the BioLinux desktop and the install window.
  14. Choose your language and “Install Bio-Linux 6″ at the bottom. Don’t click on “Try Bio-Linux”. Then select time zone.
  15. Keyboard layout “Choose your own” then select “United Kingdom Macintosh” from the right panel.
  16. Accept the defaults, then add your name and password. I set it to log in automatically here.
  17. Now click INSTALL. Almost done. It will take a few minutes to install, go and have a coffee.
  18. “Installation complete- you need to restart the computer.” This refers only to the virtual computer. Restart. “Please remove the disc and close the tray (if any) then press enter”. This is because the software still thinks you are installing from a DVD. Ignore it, and press enter, Ubuntu Bio-Linux will boot.
Congratulations, all done, you are ready to go off and play with it.
There might be a few things you want to do in this new operating system.
  • You should probably set a network proxy: System –> Preferences –> Network proxy. Similarly you might want to use the “Ignored hosts” tab to exclude your university domain “*” in my case
  • You might want to update the system software. System –> Administration –> Update manager.
  • You might want to go to the VirtualBox Manager window  and click on Shared Folders. Then add a folder from your HD where you want to keep data accessible to both operating systems. I set mine to Auto-mount when I log in. I don’t think this works until you have restarted Bio-Linux.


You may also find a preconfigured VirtualBox BioLinux image, but at the time I wrote this it wasn’t the latest version (v5). It might be worth checking.
Many thanks to Steve Moss who introduced me to VirtualBox, helped me install this, and showed me some useful stuff.
Vested interest? I am on the NERC Biomolecular Analysis Facility (NBAF) steering committee, which has a role in oversight of NBAF-W who created Bio-Linux. I don’t feel in any way biased by this, but hey, you decide.
Jun 252008

I’m not a very competent perl programmer. Even writing the word programmer here makes me slightly embarrassed. I do carry out frequent sequence conversions and manipulations with perl scripts I’ve put together though. Sometimes when I need to run a script many times I’ve found the most irritating thing is launching the scripts and pointing it towards the right input file. A much simpler option in this case is to save the script as an application and drop the files onto it to carry out the conversion. I’ve come across two options for doing this (all this is very Mac-centric I’m afraid but I’d be interested to see MS equivalents in the comments).
The first is the open-source program Platypus by Sveinbjorn Thordarson that “can be used to create native, flawlessly integrated Mac OS X applications from interpreted scripts such as shell scripts or Perl and Python programs”. Make sure that the “is droppable” check box is selected. I found it quite straightforward to turn scripts into droppable applications this way. As it says on the site, but it needs some remembering, you will need to modify your script slightly to accept the infile correctly. The basic tutorial page says the following

Enabling “Is droppable” for an app will modify the property list for for the app in question so that it can receive dropped files in the Dock and Finder. These files are then passed on to the script as arguments via @ARGV. However, the first argument to the script ($ARGV[1], $1 etc., depending on your scripting language of choice) is always the path to the application bundle (for example “/Applications/”).

Essentially this means that (in perl at least) where your input file would be identified right at the start by @ARGV[0] it should be changed to @ARGV[1] before creating your application.
Another interesting aspect is the ability to bundle in code files referred to in your script. This means for example that if you have a script that depends on bioperl, it needn’t break, just add in the path to the parts of bioperl needed.

The second option is an AppleScript droppable application. I have to admit that I have never written an AppleScript but I came across this post recently from TUAW outlining the “do script” command. Applescripts can be saved as dropplet applications onto which you drop input files. A bit of Googling reveals people using both do script “ and do shell script “”. The last seems a bit odd since is a perl not shell script, but it looks like either will work.
As an example I once created a perl script that took an alignment in a range of formats and converted to a format acceptable to phyml, then ran the program using standard settings of my choice. I have this on my desktop as a droppable application called “runPhyml”. It works very nicely for generating quick trees.

Mar 012008

In order to really get information out of building phylogenetic trees (especially large ones) some thought has to be given to how to annotate the tips (OTUs).
The two programs that seem to do this in a powerful way are ARB and Treedyn. I also want to explore Tree-Q vista, which looks promising, but haven’t really had chance yet. (Has anybody got experience with Tree-Q vista?).

Treedyn is a very good program for editing and annotating phylogenetic trees. Its action can be driven by scripts and it can carry out many sophisticated graphical transformations.

“Many powerful tree editors are now available, but existing tree visualisation tools make little use of meta-information related to the entities under study such as taxonomic descriptions, geographic distribution or gene functions. This meta-information is useful for the analyses of trees and their publications, but can hardly be encoded within the tree itself (the so-called newick format). Consequently, a tedious manual analysis and post-processing of the tree’s images is required. Particularly with large trees, multiple trees and multiple meta-information variables. TreeDyn links unique leaf labels to lists of variables/values pairs of annotations (meta-information), independently of the tree topologies, remaining fully compatible with the basic newick format.” []

What information can it be labeled with? The best thing would be to parse the information out of the original GenBank files of the sequences that created the tree. Treedyn allows conditional annotation of OTUs by adding to or replacing the existing names. This can be done from an annotation file where the information is held as “key{value}” pairs, such as accession_number{AY123456}, on a line following the unique name from the newick file.

I wrote a little perl script to do this. This could be done much better using BioPerl. My perl skills are very basic, but it works.

#! usr/bin/perl

# Creates an annotation file for treedyn from a file containing multiple
Genbank files. Annotations are of the form key{value}. Keys must not
# contain spaces.

# usage: > outfile.tlf

$/ = “//”; # break up records on genbank // delimiter
while (<>) {
/ACCESSION[ ]*(\S+)/; # matches ACCESSION line
$accession = $1;
/AUTHORS[ ]*(\w+),/; # matches first author surname
$author = $1;
/organism=”[ ]*(\S+)[ ]*(\w+).|”/; # matches genus, species
$genus = $1;
$species = $2;
/isolate[ ]*(\S+)/; # matches isolate line
$isolate = $1;

print “$accession \tgenus {$genus} \tspecies {$species} \taccession {$accession} \tisolate {$isolate} \tauthor {$author}\n”;

In addition to tip names Treedyn is able to annotate OTUs with graphical character data, some nice examples on the website.

Of course I also have some grumbles about Treedyn. It doesn’t work properly on Macs, never has. The PC version though seems very stable. The interface is an absolute nightmare. One of the most illogical and confusing I have ever seen. But you can learn to survive it with a little patience. Despite all this the actual functions are well thought out and powerful, even if applying them is difficult sometimes.

The best thing about Treedyn in my opinion is that it is open source.