Aug 272012
 

I’ve been reading a lot recently about reproducible research (RR) in bioinformatics on several blogs, and Google+ and Twitter. The idea is that it is important that someone is easily able to reproduce* your results (and even figures) from your publication using your provided code and data. I’ve been thinking that this is a movement that urgently needs to spread to phylogenetics research- Reproducible Research in Phylogenetics or RRphylo.

The current state of affairs

The problem is that although methods sections of phylogenetics papers are typically fairly clear, and probably provide all the information required to pretty much replicate the work, it would be a very time-consuming process- lots of hands on time, lots of manual data manipulation. Moreover, many of the settings are assumed (defaults) rather than explicitly specified. ‘Have I replicated what they did?’ would then be judged by a qualitative assessment of whether your tree looked like the one published, and thats not really good enough.

Why it matters

I’m assuming here that what is currently published will allow replication, though in some cases it might not, as Ross Mounce described in his blog. I have had experience of this too, but thats another long and painful story. So why does it matter? It matters because of the next step in research. If the result of your work is important you or someone else will probably want to add genes or species to the analysis, vary the parameters of phylogenetic reconstruction, or alignment, or the model of sequence evolution, or reassess the support values, or something else relevant. Unless the process of analysis has been explicitly characterised in a way that allows replication without extensive manual intervention and guesswork then this cannot be achieved. If you want your work to be a foundation for rapid advancement, rather than just a nice observation, then it must be done reproducibly.

What should be done?

Briefly; pipelines, workflows and/or script-based automation. It is quite possible to create a script-based workflow that recreates your analyses in their entirety, perhaps using one of the Open-Bio projects (e.g. BioPerl, BioPython) or DendroPy, or Hal. This would go from sequence download, manipulation and alignment (though this could be replaced by provision of an alignment file), to phylogenetic analysis, to tree annotation and visualisation. Such a script must, by definition, include all the parameters used at each step- and that is mostly why I prefer that it includes sequence retrieval and manipulation rather than an alignment file.

Phylogeneticists are sometimes much less comfortable with scripting than are bioinformaticians, but this is something that has no option but to change. The scale of phylogenomic data now appearing just cannot (and should not) be handled in the GUI packages that have typically enabled tree building (e.g. MEGA, DAMBE, Mesquite).

There is a certain attraction to GUI approaches though and something that may well increase are GUI workflow builders. The figure above is from Armadillo, which looks interesting, but unfortunately doesn’t seem to be able to save workflows in an accessible form, making it an inappropriate way forward. Galaxy is another good example, able to save standard workflows, but not (yet) well provisioned for phylogenetic analyses.

At the moment  a script-based approach linking together analyses is the best approach for RRphylo.

‘So you’ve been doing this, right?’

Err no, I’ve been bad, really I have. Some of my publications are very poor from the view of RRphylo. The reasons for that would take too long to go into, I’m not claiming the moral high ground here, but past mistakes do give me an appreciation of both the need and the challenges in implementation. I absolutely see that this is how I will have to do things in future, not just because (I hope) new community standards will demand it, but also because iterating over minor modifications of analysis is how good phylogenetics is done, and that is best implemented in an automated way.

Automated Methods sections for manuscripts?

One interesting idea, that is both convenient and rigorous, is to have the analysis pipeline write the first draft of your Methods section for you. An RRphylo script that fetched sequences from genbank, aligned them, built a phylogeny, and then annotated a tree figure should be able to describe this itself in a human-readable text format suitable for your manuscript.

The full Methods pipeline is archived at doi:12345678 and is described briefly as follows: sequences were downloaded from NCBI nucleotide database (27 August 2012) with the Entrez query cichlidae[ORGN] AND d-loop[TITL] and 300:1600[SLEN]. These 5,611 sequences were aligned with MUSCLE v1.2.3 with settings……

This is a brief and made up Methods that doesn’t contain the detail it could. Parameter values from the script have been inserted in blue. This sort of an output could also fit very well with the MIAPA project (Minimum Information About a Phylogenetic Analysis). NB it is not this methods information that needs to be distributed, it is the script that carried out these analyses (and produced this human-readable summary as an extra output) that is the real record of distribution.

Implementation

This won’t be implemented tomorrow, even if everyone immediately agrees with me that it is really important. It is much easier for most people to just write the same old methods section they always have- a general description of what they did that people in the field will understand. I went today to read a lot of Methods sections from phylogeny papers. Some were better than others in describing the important details, but none sounded relevant to the new era of large scale analysis. They sounded like historical legacies, which of course is true of scientific paper writing in general.

It will take a community embarrassment to effect a change; an embarrassment that even the best papers in the field are still vague, still passing the methodological burden to the next researcher, still amateur compared to modern bioinformatics papers, still ultimately irreproducible.

The major barrier to RRphylo is the need to write scripts, a skill with which many phylogeneticists are unfamiliar and uncomfortable. This may be helped by Galaxy or the like, allowing the easy GUI linking of phylogenetic modules and publication of a standard format workflows to MyExperiment (I think Galaxy is the future, for reasons I won’t go into here). Alternatively, maybe some cutting-edge labs will put together a lot of scripts and user-guides allowing even moderately computer literate phylogeneticists to piece together a reproducible workflow. Hal and DendroPy seem the places to start at present, and I shall have to try them out as soon as I can. Other places for workflows that are worth investigating are Ruffus, Sumatra, and Snakemake. At the moment I’ve done a decent amount of Googling and absolutely no testing so I’d be really interested in other suggestions and views on these options.

I think that Reproducible Research in Phylogenetics is incredibly important. Not just to record exactly what you did, not just to make things easier for the next researcher, but because all science should be fully reproducible- of course it should. I’m coming round to the idea that not implementing RRphylo is equivalent to not releasing your new analysis package and just describing what the program does. But maybe I’m just a lone voice?

See also 

Our approach to replication in computational science, C Titus Brown

Reproducible Research: A Bioinformatics Case Study Robert Gentleman 2004

Next-generation sequencing data interpretation: enhancing reproducibility and accessibility‘ Nekrutenko & Taylor Nature Reviews Genetics 13, 667-672 (September 2012) | doi:10.1038/nrg3305 (subscription required)

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Goeks et al Genome Biology 2010, 11:R86 doi:10.1186/gb-2010-11-8-r86 (open access)

Reproducible research with bein, BiocodersHub

Sumatra also talks about pipelines acting as an Electronic Lab Book, here’s a presentation about it.

* I am not going to distinguish between replication and reproducibility in this blog post. See here. There are differences, but both are needed.

Mar 092008
 

While I was reading the Nature paper I was talking about in my last post I was thinking about the use of the term “phylogenomics”. It seems like there are two quite separate contexts where it is used.
(1) Integrating evolutionary biology into genomics (2) phylogenetics using a lot of data

The term “phylogenomics” was first published in 1998 by Jonathan Eisen. It is clear he was talking about option 1 not 2. In a recent blog post he says

“‘phylogenomic’ analysis in the way I think of phylogenomics — that is — a integration of evolutionary and genomic analyses. (NOTE – I think it is kind of lame that people use the term phylogenomics, which I coined by the way, to refer to “using genomes to infer evolutionary trees).”

So thats clear then? Well unfortunately not. There are so many publications now using it in the second context that irrespective of the initial meaning it looks like it now has both. The wikipedia (stub) also includes both meanings.

Web of Science reveals 92 publications in 2007 with the term phylogenomic*. I didn’t go through them and work out the split but both types of phylogenomics are there in numbers. WoS lists 272 publications with “phylogenomic*” across all years (searched 9th March 2008).

In my short description above I used “lots of data” rather than whole genomes. There are relatively few type 2 papers that actually use whole genomes. Thats understandable I guess, especially for Metazoa. The Nature paper I was talking about has a data matrix of 150 genes. Although the mean is only ~50% data representation for any taxon. Thats still a huge amount of data, but its not whole genome analysis. I wonder what the smallest dataset is that self-applies the term phylogenomics?

I am not exactly clear the difference between type 2 phylogenomics and supermatrix approaches to phylogenetics. How big does a matrix have to be before its super?Ultimately the use of the same name for different things may not matter much. I see little evidence people are very confused.

Here are some papers of interest-

Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis.
Eisen JA
Genome Res. 1998 Mar;8(3):163-7

Phylogenomics: Intersection of Evolution and Genomics
Jonathan A. Eisen and Claire M. Fraser
Science 13 June 2003 Vol. 300. no. 5626, pp. 1706 – 1707
DOI: 10.1126/science.1086292

Phylogenomics and the reconstruction of the tree of life
Delsuc F, Brinkmann H, Philippe H
NATURE REVIEWS GENETICS Volume: 6 Issue: 5 Pages: 361-375 Published: MAY 2005

The supermatrix approach to systematics
de Queiroz A, Gatesy J
TRENDS IN ECOLOGY & EVOLUTION Volume: 22 Issue: 1 Pages: 34-41 Published: JAN 2007