Nov 092014
 

ReproPhylo treeIn part 1 of this series I wrote about why we need reproducible phylogenetics, here I write about what we actually need to do.

tl;dr

We need only a few classes of things (open reusable archiving of all data, information provenance, recording of data treatments & software environments), to make our work reproducible. Many of the necessary approaches are already routinely used in computational sciences, but rarely in phylogenetics, which is a shame.

Reproducibility is something that needs to be considered from the start of your work, in the same way that you would incorporate controls from the very start of your experimental design for a wet lab experiment. Perhaps make a written reproducibility plan alongside, and integrated with, your experimental design?

What do we need to make phylogenetics reproducible?

I think the only things we really need are: data provenance, recording of data treatments, archiving of software with recorded versions and settings, and public archiving of all data and outputs.

This is a surprisingly short list yet I think that if implemented this way an experiment could be fully reproducible. I have seen other lists for descriptions of phylogenetic experiments (Leebens-Mack et al. 2006), and for data sharing (Cranston et al. 2014), but I think that although these types of lists are OK, sometimes excellent, they don’t reflect a mature discipline that has learned adequately from the computational sciences.

Open reusable data archiving

Recently this has become pretty easy. Journals accept enormous amounts of supplementary data. We have repositories like the excellent DataDryad and FigShare. We also have GitHub which, although not a repository per se, can make quite a good one. Can you link to your data? Is it permanently archived in replicated repositories? Is it easy to use? My personal preference is FigShare, but you should investigate for yourself if you’re not routinely placing your research data somewhere sensible.

Provenance

What if we had a computational environment where whenever we did something to a data file to generate an output (eg build a sequence alignment, construct a phylogeny) exactly which data had generated which output was recorded for us accurately, in the background, without need for our intervention or use of a complex file naming scheme. This is data provenance, the management and tracking of information; what file provided information for what output. These tools already exist, and are used in other fields, but not usually in phylogenetics.

Recording data treatments

What if we had a computer environment that not only tracked all data files but also every change made to those data files (even if you use the same file name). This would allow you to return to any version of the file made at any point in time, and your experimental record would specify exactly what file version was used. This is the essence of version control, which is truly powerful data weapon, ubiquitous in the computational sciences, though one most biologists have barely heard of.

You could read Version control for scientific research, or Git for Scientists: A Tutorial, or Try Git. Git is a type of version control system.

Recording data treatments can be done manually, writing down each parameter for each analysis program, but what if absolutely all steps for all programs were automatically recorded without you needing to do anything? This will happen by default if your analysis is scripted, ie carried out in a computer pipeline. A computer pipeline is so-called because it carries information the way a physical pipeline carries water, connecting different places. Pipelining the flow of data between different analysis programs is seen as normal practice in many numerical disciplines, and while the use of scripts is increasing in phylogenetics, pipelines are still too rare.

Recording of software environments

What if we had a computer environment where we could just press “save” and everything would be saved, yes everything. We would not be expected to make a full list of all the software used, with  their versions. We would not have to record our operating system, and we would not have to find all the dependencies for programs and scripts to run properly. Our operating system, including all the programs we had used at every stage of the analysis, would be archived with a command. This whole-environment could then be archived as an experimental record, with a doi, allowing other scientists to download and open up this ‘image’ of our machine and carry on where we left off. These “virtual machine images” exist, and are commonly used, but not often in phylogenetics.

Conclusions

I am suggesting that most of the issues surrounding reproducible phylogenetics are solved problems in other disciplines. The things that are still challenging are not about achieving reproducibility but about achieving it easily, irrespective of computational experience, such that reproducibility becomes the default behaviour.

Next I write about some problems and the importance of reusability in addition to reproducibility.

 

Cranston K, Harmon LJ, O’Leary MA, Lisle C: Best practices for data sharing in phylogenetic research. PLoS Curr 2014, 6.

Leebens-Mack J, Vision T, Brenner E, Bowers JE, Cannon S, Clement MJ, Cunningham CW, dePamphilis C, deSalle R, Doyle JJ, Eisen JA, Gu X, Harshman J, Jansen RK, Kellogg EA, Koonin EV, Mishler BD, Philippe H, Pires JC, Qiu Y-L, Rhee SY, Sjölander K, Soltis DE, Soltis PS, Stevenson DW, Wall K, Warnow T, Zmasek C: Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). OMICS 2006, 10:231–237.

Oct 172014
 

ReproPhylo treeWe are still largely missing the benefits of reproducibility in phylogenetics. I think that this makes our lives unnecessarily difficult and makes us particularly poorly prepared to confront modern data-rich phylogenetics. In this first post “Why” I want to talk about why we need reproducible phylogenetics. Then, in part two, “What“, I’m going to talk about some possible approaches to reproducible phylogenetics. In part three, “How“, I’m going to look at some existing software solutions. Lastly, in “ReproPhylo“, I’m going to write about the work my lab is doing to bring these approaches together to create new reproducible phylogenetics solutions.

tldr; Reproducibility helps us do better phylogenetics and do it more easily. There are a number of partial solutions out there. We introduce the ReproPhylo framework for easier + better reproducible experimental phylogenetics.

Why do we need reproducible phylogenetics?

Three short answers, then I explain below.

Answer 1: to do better quality science. This is achieved by being able to build on and extend other people’s work. It is also achieved by being able to take an experimental approach to phylogenetic methodologies.

Answer 2: to make your life much easier. The person most likely to reproduce your work is “future you”, making it easy to reproduce, and then modify, your analysis will save you lots of time. It will help you to do more actual research and less reformatting of files and coaxing belligerent applications to read them.

Answer 3: If your work isn’t fully reproducible is it really science? Sure its nice work that clarifies some important issues, you’re a bright person and its likely correct, but if its not reproducible …. what the hell are you thinking? Is this why you got into science? To accept stories from other scientists based on them being bright and it sounding right? You are much more sceptical than that about the science you read, so shouldn’t people also be sceptical of your work. Yeah, exactly.

Standing on the shoulders of giants

The ability to extend the work of others, to stand on their ‘shoulders’ [1] and reach higher, is how progress is made. “Wow, if I just added these species to that tree, used their analytical approach, I could actually test [whatever]”. But can you add species and use that approach? Or do you have to start from scratch, collecting sequences from GenBank and trying to reproduce their work before extending it?

How much do you want your work to have impact going forward? Make it easy for people to extend your work and you will be influential.

‘Experimental’ phylogenetics

This refers to approaches where we test the influence of method, parameter choice, and data inclusion on our tree structure. How many studies have you seen where people explore parameter space exhaustively and explicitly compare the phylogenies produced. Not many I would guess. Any? The reason is that it is too difficult to experiment when manual approaches to phylogenetics are used. Have you ever experimented with alignment parameters in your MSA program of choice? Most phylogeneticists usually only run the defaults, you can check the Methods sections of papers to confirm this. If I have to align sequences with 6 different programs, each with 50 different combinations of parameters, and then compare some characteristics of the 300 alignments and resulting trees this will be a truly mammoth piece of work. If however I have an entirely reproducible pipeline that will iterate over parameter space and produce a detailed report with clear summaries of alignment characteristics and tree variability then this becomes not an exceptional piece of work but just something I would typically do before getting down to detailed analysis of the question at hand. If a reviewer or critic thinks I have chosen the wrong range of parameters to optimise they can simply add others and hit RUN on my pipeline to compare to my optimum values. The robustness of science improves.

Who will actually reproduce my work?

It could be anyone. What if Professor Big loves your work and wants to extend it, great! But the reality is that the person who will certainly need to reproduce and extend your work is Future You! Make life easy for yourself by starting out reproducibly, anyone else calling you a giant and wanting to stand on your shoulders is a bonus.

What makes you think phylogenetics isn’t pretty much OK now?

Long and painful experience makes me think that. Why don’t you try reproducing 3 phylogenetics results from papers and then your perspective will have changed. Can you get their data easily, or is it a list of Genbank Identifiers in a supplementary Word table that you then have to type in to NCBI website? Can you run their software? If its an old paper can you even find their software? Do you know the dependencies to run it? Do you know what version they ran? Do you know the parameters they used?  Are the default parameters now the same as then? Did they exactly record data transformations? Maybe they changed sequence names between the original files and the tree figure. Maybe some taxa were excluded as problematic. Maybe the alignment was manually improved. Maybe some alignment regions were masked. All that is fine, but do you know exactly what they did and how? Did they archive the final tree or only a picture of it? This would only allow you compare by visual inspection to see if you have reproduced a previous study. It is estimated that >60% of phylogenetics studies are ‘lost to science’ [2]. This is a problem.

What is Reproducibility?

I’m not going to cover the semantic differences between reproducibility, replication, repeatability etc. Here I take a practical view of reproducibility as a term used routinely to represent the above terms. I really like this video of Carole Goble explaining the concepts of reproducible research.

Reproducibility is the correct way to do science.

Reproducibility is so integral to what we consider the scientific process that it is hard even to make a counter case here, so I won’t really try. So why isn’t reproducibility the norm? Well a technically poor form of reproducibility is the norm, the Methods section of the journal article. Later in this series I suggest that technical challenges have prevented complete and efficient reproducibility in the past, it hasn’t been your fault, but now those challenges are pretty much solved (part3; How) and we should grasp and benefit from these new possibilities.

[1] Standing on the shoulders of giants. Wikipedia. Available: http://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants.

[2] Magee AF, May MR, Moore BR: The Dawn of Open Access to Phylogenetic Data. arXiv [q-bioPE] 2014. http://arxiv.org/abs/1405.6623