Oct 262016

carpenter's planeI like Software Carpentry, a non-profit organization teaching basic computing skills to researchers. But if I had one little criticism it would be that there wasn’t enough carpentry, with tenon saws, and mortising chisels, and rabbet planes. I like actual carpentry, I can’t tell you why, its a bit like asking why somebody likes cheese, the answer is ‘yes’.

At work I mess around with software tools to do genomics, bioinformatics, and population genetics. At home I mess around with tools. At work this year I was talking to two postdocs about an approach to a genomics problem and made some quip about the importance of sharpening your axe before starting to fell a tree. Nobody laughed and I decided to pretend that it was insightful wisdom. Then I started to wonder if I could pass off a whole range of rough woodworking folklore as insightful bioinformatics wisdom? So here goes, I’ll prepare my whetstone, sharpen my irons, and let the sawdust fly…

“If you only have 3 hours to chop down a tree, spend the first 2 sharpening your axe”

This one comes in many forms, often erroneously attributed to Abraham Lincoln, sometimes there is a lumberjack about to die, don’t worry, he’s fine. For bioinformatics though I think the rationale is that preparation and optimization are the essence of getting things done quickly, not hastily launching into the task with a dull axe.

“Measure twice cut once”

This is an old English proverb. I like to think of it as a rationale for unit testing code.

Carpenters and chips

“The best carpenters make the fewest chips” -English proverb, c.1500s

“The carpenter is not the best who makes more chips than all the rest” – Arthur Guiterman (1871-1943)

The best bioinformaticians write the least code

“When you need it to hold you’d better glue and screw it”

Yep, always glue AND screw. I’m not going to give you the bioinformatics translation for everything, where’s the fun in that?


“Without craftsmanship, inspiration is a mere reed shaken in the wind” Johannes Brahms

“Where you find quality, you will find a craftsman, not a quality-control expert” -Robert Brault

“He who works with his hands is a laborer. He who works with his hands and his head is a craftsman. He who works with his hands and his head and his heart is an artist” – St. Francis of Assissi

“When you’re a carpenter making a beautiful chest of drawers, you’re not going to use a piece of plywood on the back, even though it faces the wall and nobody will ever see it. You’ll know it’s there, so you’re going to use a beautiful piece of wood on the back. For you to sleep well at night, the aesthetic, the quality, has to be carried all the way through.” ― Steve Jobs

I really have trouble justifying the Steve Jobs approach sometimes, but its important to consider this, a lot


“One only needs two tools in life: WD-40 to make things go, and duct tape to make them stop.” unknown; attributed to G. Weilacher

What would the bioinformatics equivalents of WD40 and duct tape be?

“A simple [work]bench is like Tuscan pasta soup. You think it will be better if you add more stuff. But getting the basics right is way more important, and the extras won’t make up for a poorly prepared stock.” Adam Cherubini

For me the Tuscan pasta soup stock is unix

“Only those who have the patience to do simple things perfectly ever acquire the skill to do difficult things easily.” Friedrich von Schiller (1759-1805)

This was I think the 1st version of the Larry Wall perl rationale ‘to make the easy things easy, and the hard things possible’

“Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius – and a lot of courage – to move in the opposite direction.” E.F. Schumacher

A violent analysis, I like that idea

Bad advice, usually

The only really good place to buy lumber is at a store where the lumber has already been cut and attached together in the form of furniture, finished, and put inside boxes. -Dave Barry

No no no, always do it yourself, I’m not even joking, don’t use somebody’s GUI

Woodworking minus patience equals firewood

Law of the Workshop: Any tool, when dropped, will roll to the least accessible corner. -Author Unknown

Law of bioinformatics: Any code, when dropped, will roll your data to the least accessible corner

“Blunt tools are sometimes found of use where sharper instruments would fail” -Charles Dickens, Barnaby Rudge

I think Dickens was talking about perl again

“He says no varnish can hide the grain of the wood; and the more varnish you put on the more the grain will express itself” -Charles Dickens, Great Expectations

Pie charts, clearly

Good workmen never quarrel with their tools. Byron

I’ve quarreled with a LOT of tools. I only included him here because he was Ada Lovelace’s dad

“When the only tool you have is a hammer, you tend to see every problem as a nail” Abraham Maslow

BLAST, always hit it with BLAST, and don’t worry if it bends, you can just whack it down into the wood with a BioPython script, nobody will notice

“The things I make may be for others, but how I make them is for me” – Tony Konovaloff

I like this one

“Have thy tools ready. God will find thee work” -Charles Kingsley

I hope god understands full economic costing

“Remember, a chip on the shoulder is a sure sign of wood higher up” -Brigham Young

“I have little patience with scientists who take a board of wood, look for its thinnest part, and drill a great number of holes where drilling is easy.” ― Albert Einstein

Reviewer 3 accuses me of drilling in thin wood, but they don’t understand the pressure I’m under to publish


For safety is not a gadget but a state of mind. -Eleanor Everet

Check stuff, always check, and back it up, twice, also version control


Some of these, quite a few, I just knew and was able to sagely trot out. The rest I harvested from the web including here http://www.fouroakscrafts.com/quotes-woodworkers-craftsmen/ and here http://www.woodshopics.com/quotes/ and here http://www.popularwoodworking.com/woodworking-blogs/editors-blog/woodworking-quotations-quips-aphorisms-and-more and here http://www.quotegarden.com/woodworking.html

Feb 122016

Darwin_aged_31It annoys me slightly that almost all the pictures of Darwin circulating around on #DarwinDay are of some grey-haired old guy. Darwin made many of his important break-throughs as a young man. I mentioned ‘Darwin the dropout’ to my new class of evolution students this semester. Darwin was hassled by his dad to get a proper job. He tried medicine, but he found it a bit dull. He started to train for the clergy, but it wasn’t for him. Then he decided at short notice to go off travelling. He set off on the Beagle at age 22 (23 maybe?), only a couple of years older than my undergrad class.  I think that they liked the idea that Darwin was a bit like them, unsure of what he wanted to do, keen on having some fun. I hope some of them, like Charles, can also do good science.

In the picture he’s a much more responsible 31.

Dec 022015

There is now a second Tardigrade genome described in a manuscript (Koutsovoulos et al 2015), only days after the first (Boothby et al 2015– not OA). Koutsovoulos et al however strongly suggests that the widely publicised rampant horizontal gene transfer (HGT) is problematic. This alternate genome comes from Mark Blaxter’s group in Edinburgh, and has in fact been publicly available since 2014. I was surprised that the first genome paper didn’t compare to this in their paper (I’m pretty sure this is true, but can’t get access to PNAS content writing from home to double check).
So I’ve learned a few things reading Tardigrade papers and I’ve been very frustrated with Twitter’s 140 characters. Don’t want to try to do complex ideas on Twitter anymore, thats not what it’s for and it works very badly when you have to have a series of numbered tweets to say anything of substance. What I wanted to say was about publishing genomes rather than genome assemblies.

Preprints are great

Firstly, I’ve been a big fan of preprints for a while. This situation just confirms to me their central place in modern biology. There is a high-profile paper. Another group thinks that it doesn’t match with their analysis of same. Within a week there is a proper manuscript available to read. Its not peer reviewed, but otherwise its a normal manuscript. I read and judge it for myself, I can cite it, share it, tweet about it, and write about it in a blog. Preprints are very valuable, there is no excuse for waiting 6 months for some 20th century paper journal to reluctantly grind out a short note on your concerns. Write what you want to say and get it out there, then work on getting it in to a journal if you must.

Late release of data hurts science

Secondly I have learned that there can be a very poor delay of release of all the data with genome papers. Twitter was alive with ‘where can I download the Tardigrade genome?’, ‘data isn’t available, how can that be?” and ‘its on GenBank, just waiting for them to release it’. There was no obvious attempt to not release everything, but data release has to be planned for differently. Data release must be exactly, to the minute, synchronised to the release of the manuscript. Or release the data earlier- you aren’t going to get scooped by releasing the data a day before the manuscript, and that way you can check everything is in place. In their comparative genomics preprint Koutsovoulos et al clearly needed more access to central parts of the first tardigrade genome assemblies.

the UNC raw reads were not yet available. We were also unable to confirm directly expression of the UNC genome HGT candidates because we did not have genome coordinates for the gene predictions.

These are things I would have thought would have come out with the paper, and no later, and thats a shame for science. I’m not implying deliberate obfuscation, just the normal human condition of not getting your shit together in time, something I suffer from too, though am fighting hard against because I think its really important when publishing.

Linked data and findable data are important

Thirdly, make use of INSDC (ie GenBank, EMBL, DDJ) in addition to any other resources. Koutsovoulos et al had made their data open in early 2014. They had annotated it, provided resources for exploring it, and were improving its quality. But as I understand things it wasn’t on the SRA (correct me please), and although their url is memorable (tardigrades.org) maybe it wasn’t found by the NCU team? With hindsight should they have publicised and linked to their genome more? This is related to Tim Berners-Lee’s fifth star for data sharing ‘link your data’. That said I love Badger, the genome resource the second genome has been held in. Its a really great environment to explore and find stuff out, I wish it were more commonly used.

The age of genome papers isn’t over

I’ve heard a lot of people repeat ‘the age of single genome papers is over’. It really isn’t. However its clear that comparative genomics is very important, and don’t act surprised because comparative genomics is just normal biology: we collect data from things, we have a look at what is the same and different, we invent ideas and hypotheses, we examine those and try to make sense of stuff. But we can’t do that without single genomes, and people are reluctant to do genomes they don’t get credit for.

I wish that there was a eukaryotic equivalent to Genome Announcements. That might also help to get new genomes out there earlier, rather than waiting for a traditional story to emerge. On the other hand how is that different from just putting a preprint on BioRxiv with your description of the genome, and getting a doi? In the long run the papers we usually want to read are comparative genomics papers, and authors getting credit for releasing the data early, openly and discoverably should be encouraged however we can.

Exciting days are ahead, not just of detailed comparisons of tardigrade genomes but of comparative genomics more broadly, there are lots of surprises and new knowledge out there. Also I’ll say what a lot of people are thinking: time to reexamine the Bdelloid rotifer genome’s ‘HGT’.

Disclosure: I know most of the Edinburgh crowd well and have published with several of them.

Oct 272015

CocaintoothachedropsIs it possible for advertising to evolve without human intervention? Today I was having a great coffee break discussion with Domino Joyce and James Gilbert about student classes and evolution experiments. We moved on to whether it is possible to evolve Twitter clickbait[1]? Why would you even want to? Well it might make for an interesting student project to understand the conditions controlling evolution in populations. Here’s the idea

  • Set up a dedicated Twitter account
  • Create a script to tweet headlines and links scraped from any celebrity press or tabloids with sensationalist sounding soundbites.
  • Harvest all tweets from a list of tweetbait accounts. Create a tweetbait corpus.
  • Use the corpus to create synthetic tweets “evotweets” and allow to these evolve as described.
  • Tweet 90% standard clickbait and 10% evotweets.
    • The reason for 90% normal tweets is that otherwise people associate the account with nothing relevant when clicked through and stop following or clicking. But 90% junk will hopefully keep them going.
  • Use stats of clicks on evotweets to determine evotweet frequencies in the next generation
  • Implement recombination and possibly mutation on evotweets as they “breed” the next generation.
  • Track evolution and adaptation.
    • Do the numbers of clicks/subscriber change with time? Is this adaptation?
    • Will Kardashian sweep to fixation? Is this population size dependant?
    • Will surrealism or reality of tweet language dominate?
This could be a fun UG student project surely?
How big/varied does the starting corpus have to be? What is the optimum tweet length? Optimum rate of recombination for tweets? Does grammar matter or is it just the collection of words?
Problems. We need a way to get it mainstream with a lot of followers. That could be bought I guess. Need good click stats from Twitter. Some parts of the evolutionary model are still obscure to me. Need not to generate something that will destroy society as we know it.

Autonomous Evolving Advertising

The most frightening outcome of this, and a threat to human sanity, is autonomous evolving advertising, ie advertising that evolves and improves to occupy more of our consciousness and activity.
It wasn’t possible before the social internet age. Now though there is big business in individual and demographic specific advertising as a look at Google will tell you. The other thing that has changed is advertising population size and ‘success’ data feedback. There are a LOT of people on Twitter and Facebook and doing Google searches. They click stuff. Data is recorded. Is it possible that algorithms could be developed that evolve advertising to maximise clicks, or maximise ultimate sales? I’m guessing the advertising-bot wouldn’t care what you bought just if you did from its creator’s store. The advertising could evolve to be purchase agnostic “you look terrible, an embarrassment to your mother, go buy something to fix it at StoreX” Advertising bots would be amoral, how mean and manipulative could they become? Would advert-bots also have to evolve to avoid regulatory bodies? Is this like predator-prey coevolution?
I don’t like this future, though it would generate a lot of careers for evolutionary biologists!

[1] Clickbait if you don’t know is a posting whose goal is to get you to click on a hyperlink to a different page, no matter what. e.g. “10 celebrities as you’ve never seen them before” “Find out if Katy Perry is your soulmate” “Ronaldo: my underwear dilemma”
Jun 232015

This is a guest post by Amir Szitenberg, a postdoc in my lab @EvoHull, describing a phylogenomic investigation using ReproPhylo. Amir used to be a sponge researcher if you can’t tell from the tone below. Despite already knowing ReproPhylo could do all this rapidly and of course reproducibly, I was still both surprised and impressed by the scale and speed of Amir’s re-analyses.

My prejudice about early branching Metazoa

The most recent paper about the sister clade of Metazoa (Whelan et al. 2015), like its predecessor (Moroz et al. 2014), places Ctenophora as the first branching metazoan phylum. I have been out of touch with the topic for about three years (might as well been an eternity with the rate genomes are being sequenced) but I felt I had to do something about those cheeky party-crashing beasts (spoiler alert, they are still there, having sangrias on the earliest branch of the tree). More than that, I wanted to see how it feels to run a full-scale genomic dataset through ReproPhylo.

Whelan et al. (2015) go to great lengths in controlling for paralogy, long branch attraction, and AA heterogeneity, but the two things I found particularly intriguing were the way they control for evolutionary rate by removing the slowest evolving genes from one of their datasets, and their avoidance of a ribosomal proteins over-representation.

They recover a very robust signal placing Ctenophora branching at the base of Metazoa, and they make a point that older studies, such as the one of Philippe at al. (2009) rather find Porifera as the early branching phylum because of the over-representation of the structurally constrained ribosomal proteins. But could it be that ribosomal proteins disagree with other loci because they are slow evolving and therefore more reliable?

Plan of action

Take dataset 12 from Whelan at al. (2015) (most inclusive in species sampling, least inclusive in suspected paralogous loci) and check the change of the topology as I gradually change the loci entropy (Shannon, 2001) range in the dataset. Also, take the dataset from Philippe at al. (2009), exclude ribosomal proteins, but only use loci that are as slowly evolving as ribosomal proteins. In short, run a finer scale check of the effect of locus evolutionary rate (quantified with Shannon entropy) on the tree topology.

Why this feels like a reproducibility success

At this point I have to say that although I failed miserably at reshaping the metazoan phylogenetic history to my liking, this little reanalysis was a positive experience. First, for both publications (Whelan and Philippe) the data is accessible and very nicely laid out, including the sequence alignment and data partition information. This is not a trivial thing as often journals allow authors to get away with providing only SRA accession numbers. Want to re-analyse? A year of work is in store for you before you get sequence alignments and you have no guarantee that your alignments are the same as the ones used by the authors. Second, ReproPhylo seemed to just behave and do as it was told. Twenty minutes of setting up without (almost, see next) any custom functions required, and by my next encounter with my laptop I had the nicely annotated trees and box-plots you see here, although the analysis was fairly complex, including multiple forks with parameter and data composition variations.

The soft belly (hence the ‘almost’), was the format in which partition data was provided. We worry a lot about standard formats for sequence data, sequence alignments and trees, but not as much about the way data partitions are described. ReproPhylo can handle partitions if they are separated in advance, or if the info is in PAUP format within a nexus file. However, many other popular formats for data partitions are used. Whelan et al. provide the relevant section from the FASconCAT (Kück Meusemann, 2010) log file, a very popular program. Philippe et al. (2009) provide the info in a MrBayes (Ronquist and Huelsenbeck, 2003) style nexus file, with the gene names as a comment line. In both cases, it was straightforward to modify the function that reads a nexus file with PAUP style partitions to accommodate the data at hand. These modifications would be widely applicable, I guess, if included in ReproPhylo, but it did make me think about best practices and all of that.


The analysis, phylogenetic methods, and results are available here and include an html report, a Project file, a notebook for each of the datasets, some alignment and tree files, as well as a figures directory for each of the datasets. Just a quick credit is in order for the programs used, which are RAxML (Stamatakis 2014), MAFFT (Katoh et al. 2013) and TrimAl (Capella-Gutiérrez et al. 2009). The 210 loci from from Whelan et al. (2015) were sorted by median entropy and then divided into four subsets of 50 loci, and another one of 10, with different median entropy ranges. The 150 loci long Philippe et al. (2009) dataset was also sorted by entropy and divided to three subsets of loci with different entropy ranges. Additional Philippe et al. subset included only the 50 slowest evolving, non ribosomal-protein loci. Table 1 describes all of this in an organized manner. I tried some additional subsetting tactics which can be seen in the notebooks but which I will not discuss here.

Table 1. The dataset subsets used for phylogenetic reconstruction

Dataset Publication Loci Mean entropy
1 Whelan 1-50 2.44-1.83
2 Whelan 51-100 1.83-1.48
3 Whelan 101-150 1.47-1.07
4 Whelan 151-200 1.07-0.38
5 Whelan 201-210 0.38-0.16
6 Philippe 1-50 1.99-1.24
7 Philippe 51-100 1.22-0.75
8 Philippe 101-151 0.73-0.00
9 Philippe 50 lowest entropy that are not RP 0.99-0.15

RP = ribosomal proteins

In the Whelan et al. (2015) dataset, mean entropy ranged between 0.16 and 2.44 and was not biased by missing data (quantified by gap scores and sequence lengths; Figure 1).


Figure 1. Locus statistics distributions for loci in the Whelan et al. 2015 dataset. From top to bottom, each box plot represent the distribution of entropy, gap score, conservation score and sequence length in one locus. Since the dataset is of protein sequences there is no %GC information. Entropy does not seem correlated with other statistics with the exception of conservation scores. Median entropy ranges between 0.16 and 2.44.

Throughout all the tree figures ReproPhylo has automatically coloured the Ctenophora pink, sponges purple, Cnidaria yellow, Placazoa green, bilatarians and outgroups white following the colour scheme of Whelan et al. All the five loci subsets in the Whelan et al. (2015) dataset recovered Ctenophora as as the sister clade of Metazoa. The tree constructed from the most fast evolving loci is in Figure 2.


Figure 2. A phylogenetic tree reconstructed from Whelan et al. 2015 loci with median entropy of 1.83 – 2.44.

Ctenophora were also recovered as the earliest branching metazoan lineage in the Philippe et al. (2009) dataset when the 50 fastest evolving loci were used. It was not recovered as such with the 50 slowest evolving ones. Half of these loci are ribosomal proteins (Figure 3).


Figure 3. A phylogenetic tree reconstructed from Philippe et al. 2009 loci with median entropy of 0 – 0.73.

When ribosomal proteins are excluded, the 50 slowest evolving do yield Ctenophora as the earliest branching metazoan phylum (Figure 4). Although the entropy range is different. The slowest evolving ribosomal protein has a 0 median entropy value, and the slowest evolving non ribosomal protein has a median entropy value of 0.15, which is comparable with the slowest evolving gene in the Whelan et al. (2015) dataset (0.15 and 0.16 respectively).


Figure 4. A phylogenetic tree reconstructed from Philippe et al. 2009 loci with median entropy of 0.15 – 0.99, no ribosomal proteins.


As Whelan et al. (2015) claim, ribosomal proteins conflict with the otherwise very robust signal placing ctenophores at the base of the metazoan tree. It could be claimed that since on top of being structurally constrained (aren’t all proteins?), they are also the slowest evolving, and are therefore more reliable, rather than causing a bias. However, it is hard to ignore the otherwise impressive consistency of Ctenophora’s position in the tree, when the ribosomal proteins are excluded.

But my main point here really is to show how efficient and effective a reanalysis can be if the data is accessible, and with a tool like ReproPhylo at our finger tips. Truly, I spent by far much more time writing this post than I did running the analysis, which, mind you, without any effort on my part, produced the Git repositories (available as zipped folders in figshare) and a human readable output containing an html report, scripts, data files and figures (also in figshare).

ReproPhylo webpageReproPhylo githubReproPhylo manual

@ReproPhylo twitter


Capella-Gutiérrez, Salvador, José M. Silla-Martínez, and Toni Gabaldón. “trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.” Bioinformatics 25, no. 15 (2009): 1972-1973.

Katoh, Kazutaka, and Daron M. Standley. “MAFFT multiple sequence alignment software version 7: improvements in performance and usability.”Molecular biology and evolution 30, no. 4 (2013): 772-780.

Kück, Patrick, and Karen Meusemann. “FASconCAT: convenient handling of data matrices.” Molecular Phylogenetics and Evolution 56, no. 3 (2010): 1115-1118.

Moroz, Leonid L. “Convergent evolution of neural systems in ctenophores.” The Journal of experimental biology 218, no. 4 (2015): 598-611.

Philippe, Hervé, Romain Derelle, Philippe Lopez, Kerstin Pick, Carole Borchiellini, Nicole Boury-Esnault, Jean Vacelet et al. “Phylogenomics revives traditional views on deep animal relationships.” Current Biology 19, no. 8 (2009): 706-712.

Ronquist, Fredrik, and John P. Huelsenbeck. “MrBayes 3: Bayesian phylogenetic inference under mixed models.” Bioinformatics 19, no. 12 (2003): 1572-1574.

Stamatakis, Alexandros. “RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.” Bioinformatics 30, no. 9 (2014): 1312-1313.

Shannon, Claude Elwood. “A mathematical theory of communication.” ACM SIGMOBILE Mobile Computing and Communications Review 5, no. 1 (2001): 3-55.

Whelan, Nathan V., Kevin M. Kocot, Leonid L. Moroz, and Kenneth M. Halanych. “Error, signal, and the placement of Ctenophora sister to all other animals.” Proceedings of the National Academy of Sciences 112, no. 18 (2015): 5773-5778.

DOI of this blog post:

Szitenberg, Amir (2015): Reanalyses done for the blog post “Modest reproducibility success”. figshare.


Retrieved 16:25, Jun 22, 2015 (GMT)

Jun 202015

AmirSzitenbergThis is a guest post by Amir Szitenberg, postdoctoral researcher in my group at the University of Hull, and main author of @ReproPhylo

I find the ReproPhylo approach to experimental phylogenomics very exciting, and can see how it would lead to better, in depth understanding of phylogenomic datasets, regardless of their size. An example for this is described in mine and Dave’s preprint, written together with Max John and Mark Blaxter. As for the implementation of reproducibility tools in ReproPhylo, they are meant to be completely the opposite: quiet and unexciting. Something that allows us to focus on the good stuff. However, since this is my first ever blog post on ReproPhylo, I will focus here on reproducibility aspects of the programme. The ReproPhylo environment has reproducibility features that are built into the python module, others that are the benefit of ReproPhylo’s Git integration, and an extra layer of reproducibility is gained by distributing ReproPhylo as a Docker image.

ReproPhylo on its own (well, with its python module dependencies)…

Any pipeline that uses scripts is a step forward toward reproducibility. However, in addition to having a concise scripting syntax, ReproPhylo achieves several additional goals.

Explicit provenance

Provenance (information about the data inheritance chain that produced a certain result) is a tough one. It is very easy to produce a very nice tree, but then have doubts regarding which version of the sequence alignment produced it. ReproPhylo circumvents the problem. A single item, the Project class object, contains all the datasets, including inputs, intermediate and outputs, with IDs associating them with the process that produced them. If you have this object in hand, you have provenance. You do not have to match the tree file to the alignment, or to the sequence version used in it. Reporting methods will make the relationships among the Project components clear to the user.

Persistence, repeatability, and extendibility

The Project object is automatically and continuously pickled. ReproPhylo uses the cloud python module, with its ‘pickling’ functionality, to save the Project object as a binary file. ReproPhylo will update the pickle file whenever an action is taken (eg, sequences are aligned). This file secures the persistence of the analysis as it can always be read again to review or continue the analysis, or to access the data in it. ReproPhylo utilizes commonly used phylogenetic analysis programmes (see manual) that can be easily and flexibly controlled with built in functions. However, since data are always maintained as standard Biopython and ETE classes (SeqRecord, MultipleSeqAlignment and Tree) they can be accessed without ReproPhylo, and can be directly plugged into pipelines utilizing these modules, thus releasing/ reading data to or from any programme that is not yet integrated. The original data objects, nested within the Project, can be tweaked or utilized in place, or instead, there are Project methods to produce copies to work with, keeping the original as is. Interfacing without dependency on Biopython and ETE can be achieved by Project methods that produce or read text files in any of the formats compatible with Biopython and ETE.

Figures and tables

ReproPhylo stores the metadata of the sequences, as read from GenBank files or CSVs. As the metadata appears there, so it will on your trees. There is reduced opportunity for human error in the transference of metadata from data bins to trees or built in analyses and steps (e.g. trait matrix for Bayestraits), as long as it was fed in correctly initially. Species names will be the same as in the original GenBank record, or as set in the metadata spreadsheet, no matter if you decide to add them to your figure post hoc. Need to change a species name or some other metadata? Change your spreadsheet, the change will propagate to your figures.

Reproducible publications

ReproPhylo, upon request will produce an archive file containing everything needed for publishing the analysis. It will contain the sequences and metadata as a GenBank file, the trees and alignments as a PhyloXML file, the figures, and a report providing detailed descriptions of the methods and data composition.

With Git integration there’s also…

Provenance integrity – the final nail

Git works quietly at the background. ReproPhylo will record a version of the pickle file any time it is updated, as well as of input files, scripts and Jupyter notebook files. Built in Project methods allow to view the Git commits and to revert the Project to other versions than the current. Toggling between, say, tree versions in this way, cannot damage provenance information, because the whole Project containing the tree will be toggled along with it, preserving prevenance.

Symplified publication of complex pipelines

A consequence of using Git to allow toggling between Project versions is the creation of a Git repository ( a .git directory). ReproPhylo makes sure not to interfere with pre-existing Git repositories and ones that do not explicitly belong to the current Project. This Git repository is all you would need to publish your workflow as it can be pushed to GitHub and then given a FigShare DOI, using FigShare’s Git integration. This way of publishing your analysis is a more direct and cuts out the middleman which is supplementary files.

Working with Docker (or other virtualization solutions)

Unlike Git, Docker is not integrated in ReproPhylo, just used as a method of distribution. However, the combination of the ReproPhylo Docker image and the Git repository produced by your analysis are as close as one can get to ultimate reproducibility, as far as I can see. The Docker container is the environment in which the analysis was done, and all the challenges of recreating this environment do not exist anymore. This combination of a Git repository and a Docker container, eliminates hours of setting up a high quality reproducible publication and of installation and configuration for a reader interested in repeating and/or extending the analysis, and makes them a thing of the past. Since ETE and matplotlib communicate with the host OS X11-server to produce graphics, some steps are required to couple the X11-server in the docker container with that of the host OS. This may seem to stand in the way of a slick installation process, but solution to this is simple in Linux OSs, and comes in the form of a shell script that manages all the steps starting with the image pulling and ending with serving Jupyter notebook to the local default web browser. However, Docker is not the ultimate solution on all platforms. In OSX and WIndows, there is no simple solution, since Docker operates as a virtual machine in these operating systems. For Windows, I solved this by forfeiting containerization. Instead, ReproPhylo is distributed as a WinPython self contained version. Just download, extract, and fire-up the Jupyter notebook. In OSX, I lean towards the other direction, replacing the Docker image with a full scale Ubuntu VM image (for full installation details see the manual). I would have loved to have a single distribution which is truly cross-platform and seamlessly installs on any machine, but this doesn’t seem likely to happen in the near future.


The tools for easy and reliable reproducibility exist. It is putting them together and configuring them for our needs, as ReproPhylo attempts to do, which might take some time. However, the time put into it is undoubtedly regained when these tools are routinely taken advantage of.

ReproPhylo webpageReproPhylo githubReproPhylo manual

@ReproPhylo twitter

May 272015

Our new phylogenomics environment is called ReproPhylo. It makes experimental reproducibility frictionless, occurring quietly in the background while you work on the science. The environment has a lot of tools to allow exploration of phylogenomics data and to create phylogenomic analysis pipelines. It is distributed in a Docker container simplifying installation and allowing the reproducibility of the experimental computer environment in addition to files. I’ve outlined the background to this in previous ReproPhylo posts.

ReproPhylo is not a phylogenomics pipeline it is a reproducibility environment

Well OK it is a phylogenomics pipeline, and I think a very good one, but that was not the primary objective in creating it. We did not set out to make the world’s most sophisticated pipeline for phylogenomics because we felt that was meaningless without reproducibility. There are a number of phylogenomics pipelines out there, and I look forward to learning more from them how they approach phylogenomics. For me, however the starting point has to be reproducibility. We started with the question ‘what would fully reproducible phylogenomics look like?’ and we got to here.

What is reproducibility? and why replication isn’t enough

What is reproducibility? This is important because it helps us to think about workflows, and what we need to achieve them. Firstly, we want to be able to take an experiment we (or someone else) has previously run and re-run it getting exactly the same outputs. This is replication. Secondly, we want to extend or modify the experiment, adding our new sequences perhaps, or using a different tree building algorithm or parameters. This repeat with modification is reproduction. Reproducibility is also used to describe the field, a general term for all types of ‘this sort of stuff’ which is how I usually use it. Replication isn’t enough. Yes we need to be able to repeat the experiment to check that it really works, but then what? Science is about extending other’s work, building on their discoveries, standing on the shoulders of giants to see farther. ReproPhylo has not been designed to just freeze the experiment for replication.

ReproPhylo promotes reproducibility not just replication

An important component of reproducibility is the ability to repeat the experiment with modification. This is the normal scientific process in phylogenetics and phylogenomics, and elsewhere. ReproPhylo promotes this approach, not just by providing the infrastructure to ensure that the experiment can be repeated, but also by providing the user with extensive help in data exploration, to identify loci and parameters for further investigation.

phylogenetic focusReproPhylo helps you to carry out exploratory data analysis (EDA) by providing summary statistics and plots of the raw data and alignments. We have selected a range of these to display, and write an html report characterising the data and results for every analysis without you needing to do anything. Ignore the report, casually browse it, or use it as the basis of your next experiment as you wish. The nature of the stored data, the powerful statistics and plotting modules available in this python environment mean that it is easy to customise your EDA.

The figure to the left is my version of that by Leek and Peng 2015 discussing p-values. I could have added many more grey boxes. The point is that there is much more to phylogenomics than just how big your support values are, and EDA can help.

EDA is assisted by ReproPhylo presenting many useful plots and summaries. However if you want something else you only have to calculate it once, from then on its available to you, either displayed by default, or when you request it in the pipeline.


 Data filtering

When you have a lot of data, a few poor-quality loci may not matter much, might not affect the outcome. But how can you tell? Ideally you should be able to identify loci that are atypical or do not meet some criteria that you specify. Then it should be easy to branch the workflow to exclude loci from the workflow in a separate iteration, and then to test the effect of the change. We do this in the ReproPhylo paper. Does EDA solve all your problems? Probably not, but it gets you to ask questions. In the boxpot do the small outlier loci for 18S make any difference to the analysis? Exclude them and repeat the analysis to test, its easy if the analysis is reproducible and you just need to rerun the script.


What is the point of being completely reproducible if it’s just too much work for anyone to actually reproduce your work? I’m sure we’ve all come across situations where we just give up because the goal doesn’t justify the work to get there. I recently spent 2 days of my life trying to open somebody else’s supplementary Nexus format file! Your work should be reusable. This means that it should be easy for someone else to pick up your experiment and extend it, and although reusable is often overlooked it is vitally important. Some of the largest challenges with ReproPhylo were with reusability.

ReproPhylo environments are easy to install

We decided that to be fully reproducible we should try to reproduce the computer environment rather than just the script. This means that when you download somebody else’s workflow you will have the same versions of the software, the same dependencies, and the same ReproPhylo script. It should run identically. This also means that everything is installed in one go.

We have used Docker, the popular container software to do this. I was surprised how easy it was to install Docker ReproPhylo containers. Really it is just one command line to run a shell script. It takes a few minutes the first time (in order to download and install Docker) and is pretty fast thereafter. From that point on you open up the Jupyter (aka IPython) notebook and have everything in place.

Jupyter notebooks make a great interface

I originally thought that ReproPhylo would run primarily in the Galaxy web interface, and we do have it running in this environment. I have been really pleased however with how it looks in a Jupyter notebook. I really like this GUI which is a mixture of explanatory documentation and code snippets (sometimes called literate programming). Our naive guinea pig testers (thanks Dan, Claudia, Max et al.) seemed to take to it well. I wonder if we will maintain the Galaxy interface going forward?

Running ReproPhylo

Is ReproPhylo as easy to use as a binary GUI application like your favourite phylo program? Almost, but not quite, is my honest answer. It is not intended to be, that’s not where we put our effort. It is almost as good though. You can be up and running quickly with one line in the terminal, the instructions for this are written for beginners. Don’t think the documentation is clear enough, or quite right for your OS? Change it! The docs are editable by anyone. Once it’s running open up a notebook that is close to what you want to do from the phylogenetic library, e.g. “Single locus align and ML tree”, or “concatenated alignment with partitions and ML”. (this library is a work in progress, though we already have some written on GitHub) Once you have the notebook open the code snippets to run the analysis are surrounded by documentation explaining, and highlighting the bits to change for your experiment “change this to specify your sequence data”. You should probably just run it to check your data is OK, then tweak the defaults to taste, e.g. swap the aligner, or post alignment trimming or tree parameters. This is much easier than starting with nothing and having to read the damn manual before doing something!

The right balance between naive users, and experienced bioinformaticians and phylogeneticists is challenging. We think that literate programming in Jupyter notebooks is the right approach. Don’t need all the clutter? Delete it and just code your pipeline using ReproPhylo. Lost? Read, ask for help, make explanatory notes for yourself in the notebook, run and repeat. Next time you start with your very detailed notebook and go from there.

What parts of the experiment should be archived and how?

All of it, and in several ways. ReproPhylo achieves reproducibility by a mixture of git, pickle, and saving standard format data files into a Docker container. Amir Szitenberg is going to do a guest post describing some of the underlying workings. At the end of your run, you will have a frozen (pickled) version of the experiment process, that can be easily re-awakened (unpickled) to carry on where you left off. You will have a git repository that has quietly tracked all changes in the background. You will have key outputs (data, alignments, trees) saved in standard text formats (e.g. nexml) that can be imported to other programs outside of ReproPhylo if necessary and a .zip archive of this. The first thing you should do is probably just upload the .zip archive to FigShare, and get a doi. Include the doi in your manuscript, job done, you are now much more reproducible than almost everyone else. To be completely reproducible, and make re-use more likely, you should also archive the Docker container that you have been working in, so storing not just all the data but also scripts and the whole computer environment. It occurs to me writing this that it is idle speculation. The only way to really work out how your experiment should be archived us to have people try, and fail. Archive everything, in several ways, and feed back when things don’t work well.

Computer backups and direct debits

If reproducibility is something you intend to do tomorrow, when you have just finished this analysis, then it has failed. All the evidence suggests that post-hoc reproducibility will be infrequent and unreliable. Computer backups are a good analogy for reproducibility. It is clear that the only strategies that work for backing up your hard drive are ones that do it automatically without requiring your input. All the technical sites agree on this, and however much you really do intend to backup every Friday before you go home, it won’t happen like that. Guilt? That just makes you guilty but no more responsible. Reminders? That works for a short time but then you become desensitised. Automated backup scripts, running in the background, not bothering you but just saving your work to a different drive, yes that works. Businesses know the value of automated action without user input, thats why they are keen on direct debit payments from your bank account, and give rewards for setting them up. They know even good people with the best of intentions are unreliable.

So, can we implement reproducibility quietly, frictionlessly, in the background, without annoying people? Frictionless reproducibility is something I am very keen on. I talk about it a bit in the YouTube video as part of Nick Loman’s Balti&Bioinformatics series. ReproPhylo will take care of reproducibility without you having to do anything except choose to use the ReproPhylo pipeline. You do not need to save your work, or make notes of what file was used, or the settings for that analysis. Just do it, ReproPhylo will record and archive everything, frictionlessly in the background.

If you decide to develop your own reproducibility approach (great!) I very highly recommend that you make it frictionless and don’t put the burden on the user to behave well. I’m really into reproducibility, but I don’t always behave well. We are all just too busy to do this, I want my electricity bill paid automatically, my hard drive backed up regularly and silently, and my phylogenomics reproducibility taken care of for me, without any friction. That way I can concentrate on science.

Here’s something I wasn’t expecting…

ReproPhylo makes your work faster

ReproPhylo is really fast. Not by improving algorithms but by making everything else automated. “Everything else” is the biggest time suck. Example workflows are provided, modest at present, but you can save a library of your own previous analyses, edit the one closest to what you now wish to do and run. You will not have to set up the pipeline from scratch each time. You can also take good ideas from other people’s workflows.

EDA is really important. Most people don’t do it because it’s time consuming. How would you plot a histogram of sequence lengths to check for short ones? How would you identify sequences with a lot of ambiguities? How would you determine if GC content was homogeneous across taxa, or loci? Does your method to do these things scale? ReproPhylo presents you with a lot of information, and it’s not hard to add other things yourself. Want a hexbin plot with marginal distributions of ambiguities vs sequence length? Google it, take a code snippet, and paste it into the notebook. You can’t extend the functionality of standard applications like that!

The future

ReproPhylo is a work in progress, it works and works well, but it does not do everything. It is being developed by Amir Szitenberg, and since we are using it heavily for phylogenomic analyses it is likely to improve steadily. We would love others to use it, and are happy to help. Its all on GitHub as CC0, with public editable documentation. Even if you don’t like our approaches (and we hope you will) you should think very seriously about building your pipeline into a reproducibility environment. You are welcome to make use of ours, you don’t even have to ask. Alternatively contribute to ReproPhylo and make it incorporate your needs.

On our list are many small things, additions that are needed, but the big things are more interesting. ReproPhylo does not yet handle large parallel jobs well and that is going to be needed. Making phylogenomic analysis software scale to increasing data is a major challenge for everyone. Also a challenge is scaling the users’ ability to understand the nature of the big data they are feeding the script for analysis. This is also an important challenge else rubbish in/rubbish out will be a common phrase.

@ReproPhylo is the twitter handle to follow progress.

Webpage and code http://hulluni-bioinformatics.github.io/ReproPhylo/

Documentation is here http://goo.gl/yW6J1J

Preprint http://dx.doi.org/10.1101/019349

Nov 232014

I’ve recently come across the idea of stars for open data quality thanks to Steve Moss. The table below is from 5stardata:

make your stuff available on the Web (whatever format) under an open license
★★ make it available as structured data (e.g., Excel instead of image scan of a table)
★★★ use non-proprietary formats (e.g., CSV instead of Excel)
★★★★ use URIs to denote things, so that people can point at your stuff
★★★★★ link your data to other data to provide context


How does this relate to phylogenetic data? Here is my suggestion for a star system for phylogenetic data:
Anyone want to suggest changes to this star system?

publish a picture of your tree in a journal article
★★ make seq alignment, tree & metadata available in suppl data with the paper
★★★ as 2star but save as XML e.g. NeXML, PhyloXML in supplementary data with the paper
★★★★ as 3star but place open access NeXML file on FigShare or Dryad with URIs
★★★★★ as 4star and link your data to other data to provide context


1 star: Surely we are past the point where people do not archive their newick tree file? Or am I being too optimistic?

2 star: This seems to be the current standard. Metadata often means a Word doc table or Excel spreadsheet. Unfortunately these complex and fragile proprietary file formats create a barrier to machine reading the data. A simple csv file would be much better and you could easily open it in Excel if you really insist. Surely open access publication is a prerequisite for 2 stars?

3 star: Want to increase your star rating? This would be an easy step to take for most people. Many good programs are supporting new rich standard formats like NeXML and PhyloXML and we should hassle the authors of software not doing so.

4 star: Again this is an easy win. Make sure your data is open access, machine findable and machine readable. Figshare is ridiculously powerful and easy to work with. Your files (all of them) can be bulk uploaded. You will get a repository doi link to quote in your manuscript and share with people. Individual files have doi links too.

5 star: This is more difficult to do well. Some of this may have been achieved by use of XML files, but how much? I have a lot to learn here about linked data. Having files that use the NCBI taxonIDs and official gene names allows automatic link-outs to be created. XML files can do exactly this. But how well does this work? The potential of linked data is also bigger than this, I have more reading to do. I like Tim Berners-Lee’s bag of chips (crisps!) analogy.


The idea of data stars originated with Tim Berners-Lee, and there are nice descriptions of the system on 5stardata.info and a YouTube “bag of chips” video of TBL explaining many of the ideas.

Edit: I liked the idea of starting the list at zero, no stars, because (A) thats how computer languages count (B) you don’t deserve any stars at all for just putting a picture of your data in a publication. But it seemed too petty.


  1. Cranston K, Harmon LJ, O’Leary MA, Lisle C: Best practices for data sharing in phylogenetic research. PLoS Curr 2014, 6.
  2. Cranston K, Blackburn D, Brown J, Dececchi A, Gardner N, Greshake B, Harmon L, Holder M, Holroyd P, Irmis R, Jansma R, Lloyd G, Mabee P, Miller M, Mounce R, Mungall C, O’Leary M, Pardo J, Parr C, Piel WH, Stoltzfus A, Turner W, Vision T, Wright A, Watanabe A, Wolfe J: Simple rules for sharing phylogenetic data. figshare 2014.
  3. Sharing data with Open Tree of Life [http://blog.opentreeoflife.org/data-sharing/]
  4. Stoltzfus A, O’Meara B, Whitacre J, Mounce R, Gillespie EL, Kumar S, Rosauer DF, Vos RA: Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis. BMC Res Notes 2012, 5:574.
  5. Han MV, Zmasek CM: phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics 2009, 10:356.
  6. Vos RA, Balhoff JP, Caravas JA, Holder MT, Lapp H, Maddison WP, Midford PE, Priyam A, Sukumaran J, Xia X, Stoltzfus A: NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Syst Biol 2012, 61:675–689.
Nov 212014

ReproPhylo tree


Phylogenetic experiments need explicitly designed reproducibility, rather than accidental or partial reproducibility. There are many working reproducibility solutions out there differing in their approach, interface and functions. There is no perfect solution for all cases, and you can learn a lot by investigating.

Here I discuss a few software approaches to reproducible phylogenetics. I’m not sure that it is possible for me to review each of these here. Instead consider it a list of bookmarks for you to see the much better examples and descriptions at their home websites. I’ll make some brief comments about their suitability for reproducible phylogenetics though please remember that your experience may differ from mine, and I have only relatively briefly investigated each.


Sumatra is a tool for managing and tracking projects based on numerical simulation or analysis, with the aim of supporting reproducible research. It can be thought of as an ”automated electronic lab notebook” for simulation/analysis projects”

The extensive Sumatra documentation refers mostly to the command line usage but I really like the GUI that runs out of a web browser tab. Data provenance is managed for you, with the ID of information recorded. Version control is via git or similar though not really incorporated into Sumatra itself, instead Sumatra nags when trying to use data not present or changed in the git repository. I wish this was a little more frictionless, but it might be my lack of experience. A typical design would be to run a script with a parameter file and I’m not sure of its flexibility with more complex arrangements.

make files

Make files are popular in some circles (Broman: “Minimal make”) and can certainly help achieve reproducibility. I can’t help feeling though that this is a direction only for the computationally very experienced. It doesn’t seem likely that this is the way forward if the goal is to increase community reproducibility. What is the best approach is irrelevant unless it is used, and make is unlikely to be broadly used.


phyloGenerator (Pearse & Purvis 2013) isn’t primarily reproducibility software, but is a phylogenetics pipeline that describes itself as:

“an easy way for ecologists to make realistic, tenable phylogenies. One-click install and fully-customisable”

I include it here because it’s a great piece of work by Will Pearse and is a wonderful example of a phylogenetic pipeline. Even if many aspects of explicit reproducibility are missing it is considerably closer to reproducible phylogenetics than standard approaches.


ETE-NPR is a really smart and powerful package from Jaime Huerta Cepas and the ETE team, describing itself as:

“providing a complete environment for the design and execution of phylogenomic workflows”

It uses the Nested Phylogenetic Reconstruction approach of Huerta-Cepas et al, 2014. One thing I particularly like is the portability of the software as the package makes use of Vagrant to allow standardisation of the environment. Like phyloGenerator, ETE-NPR isn’t reproducibility software, but the workflow design means that it is closer than most approaches and could become so.


Taverna (Wolstencroft et al. 2013) is one of the leaders in reproducible workflows, and there is great documentation and videos available. I do not wish in my own work however to use web resources for analysis, they are too unreliable. If you feel differently then Taverna is perhaps for you. One of the things I love most about Taverna is the easy sharing of workflows via the MyExperiment website. Need to build a workflow? Someone will have almost done it before and posted it for all to use, now you only need to tweak a workflow.

Knitr & Sweave

The R community has lead the way in creating reproducible experiments, and sweave and knitr are often involved. Pweave is a python version of sweave. These generally create documentation of the experiment by using the ability to mix markdown and code in a single system. Since the actual code is found in the report the experiment can be reproduced if data is also archived. These seem like mature systems, but I don’t use R much, and I can’t quite see the advantage over IPython notebooks (below).

IPython notebooks

I am very impressed by IPython notebooks for user-friendly reproducible science. The notebooks are a mixture of cells, which can be code cells or documentation cells. Code is not just Python, but also 15 other languages including R, Julia, or even system shell commands. Code runs in an interactive way, with outputs appearing in the notebook instantly under the script that generated them. There are even interactive widgets to modify parameters and see its effect on the data in real time, via sliders. It has a large and active user community, is under active development, and is widely used by scientists of all disciplines. For me the best thing however is that it is the most user friendly of all scripting solutions. The documentation cells mean that you can create something resembling a basic GUI in many ways: “Change example.fas below to your real sequence file name, then press run to get a histogram of the sequences GC content”.


I have really noticed how quickly students and postdocs take to scripting in IPython notebooks. Its interactive nature is a very intuitive way to write and modify code.

Script-based phylogenetics means that all parameters and the instructions for data transformations must be reproducibly recorded. Integrating version control like git can solve the recording of all data file changes, and even explicit provenance is possible.

The future of IPython Notebooks probably lies in Project Jupyter which evolved from IPython, is language agnostic, and is being developed with the Julia community and Google. Google have been developing Colaboratory which I am really excited about. It uses Jupyter and Google drive with all the sharing, commenting, and collaboration options that gives.

This Video introduces Jupyter and then Colaboratory (about 9 mins for both talks).


Galaxy (Goecks et al. 2010) is probably the most powerful and usable solution for reproducibility with a large and vibrant developer community. It is primarily used for genomics but one of the most interesting solutions so far is Osiris (Oakley et al. 2014) from Todd Oakley’s lab. This leverages the built-in reproducibility aspects of Galaxy to achieve reproducible phylogenetics. Provenance, data transformation, software environment, and open archiving are all achieved. This is great work. Galaxy certainly can deliver reproducible phylogenetics, but it is not clear that the work will be easy to reproduce outside of Galaxy. Should we be concerned about that? I still feel that there is work to be done though, alternatives to this approach. The fact that Osiris (and ReproPhylo, our solution) are open source though allows much more rapid progress as we discover what works best.


Next I’ll introduce ReproPhylo. Amir Szitenberg is a postdoc with Mark Blaxter and I and deserves almost all the credit for delivering this reproducible phylogenetic environment. It works with both Galaxy and IPython notebooks. We’ve done some alpha testing, are working on the documentation, and Amir is adding a few last features. Almost finished, almost.


Broman K. Minimal make: a minimal tutorial on make [http://kbroman.org/minimal_make/]

Goecks J, Nekrutenko A, Taylor J, Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11:R86.

Huerta-Cepas J, Marcet-Houben M, Gabaldón T: A Nested Phylogenetic Reconstruction Approach Provides Scalable Resolution in the Eukaryotic Tree Of Life. PeerJ PrePrints; 2014.

Oakley T, Alexandrou M, Ngo R, Pankey M, Churchill C, Chen W, Lopker K: Osiris: accessible and reproducible phylogenetic and phylogenomic analyses within the Galaxy workflow management system. BMC Bioinformatics 2014, 15:230.

Pearse WD, Purvis A: phyloGenerator: an automated phylogeny generation tool for ecologists. Methods Ecol Evol 2013, 4:692–698.

Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, Soiland-Reyes S, Dunlop I, Nenadic A, Fisher P, Bhagat J, Belhajjame K, Bacall F, Hardisty A, Nieva de la Hidalga A, Balcazar Vargas MP, Sufi S, Goble C: The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 2013, 41(Web Server issue):W557–61.

Nov 102014

ReproPhylo treePreviously I wrote about (1) why we need reproducibility in phylogenetics, (2) what we need to achieve it. This is part 2b, still writing about what we need to achieve reproducibility. My conclusion before was:

“that most of the issues surrounding reproducible phylogenetics are solved problems in other disciplines. The things that are still challenging are not about achieving reproducibility but about achieving it easily, irrespective of computational experience, such that reproducibility becomes the default behaviour.”

Phylogenetic reproducibility is an emerging discipline, and in addition to technical issues in achieving reproducibility (discussed in the previous post) there are other challenges:

Avoiding post hoc claims of reproducibility

Credit is sometimes taken for partial or accidental reproducibility, without a clear reproducible design or test of reproducibility. “I have provided [some key data] that many studies do not usually share, therefore my work is reproducible”. But has your reproducibility been planned, considered, and optimised? Where is the test of your intended reproducibility? This post hoc claim isn’t bad, it is better than average, but we should not allow this unplanned, unevaluated reproducibility to become default behaviour.

Reproducibility and Reusability

We need our science to be reproducible, but also reusable. In addition to the possibility of reproducing the work, we need to make it so that it is easy to reproduce, easy to modify, and easy to reanalyse in different ways. If we are going to do reproducible phylogenetics we should make it useful, i.e. reusable by other scientists, else we have achieved little. Reusability is important.

What do we need to make phylogenetics reusable?

tl;dr We need the ability to reproduce in a reasonable time frame with no more than just a reasonable amount of effort. Standard data formats, original software and dependencies, and analysis instructions, are best wrapped into a connected set of instructions called a “pipeline”.

Pipelines: wrap the workflow in a script

The use of analysis pipelines automatically achieves most of the things we need for reusable and reproducible phylogenetics. The pipeline can call programs in the right sequence, to analyse the correct data files, using the correct parameters, and save outputs in the correct format. All settings for the analysis should be de facto recorded, and an experimental record or log file of the entire analysis can be automatically written. This should make the replication of results a ‘one click’ task, and simple modifications of the original analysis will require only changing a parameter.

Use standard data formats

I saw someone write that there would be a special place in hell for people inventing new formats for sequence data! Perhaps a bit strong, but I estimate that I have used 1 year of my working life swapping between data file formats. I think this is an accurate estimate, it’s not a joke. Often this has required using 3 separate programs, each offering to read/write in different variants, to get the format into my final application. Things are better now with BioPerl, BioPython etc processing pretty much all sequence file formats, but they still require standards. Standard data formats are a serious matter, do not invent new ones, do not use modifications of standard formats that work in one program only. Instead reject that program and email the author to explain why.

Provide the original software and all its dependencies

It is unfortunate but true that we must include the original software in a reproducible experiment (Morrison 2013; Lunt 2013; Bergman 2012). The reasons why include both reproducibility and reusability. Software, and particularly old versions of software, goes extinct. The author leaves science, or stops having funding to maintain the software, the webpage stops working, and that software is no longer available. Journals do not archive software when they publish the manuscript although we should push for this. Even if software did not go extinct, providing the software aids reproducibility since there is then no question of the software version used. Lastly the distribution of software as part of the experimental package greatly increases reusability since the workflow is bundled together with no need to scour the web for correct versions of the analysis programs used.

Much of this it seems is most easily achieved by saving a virtual machine or similar. A VM is not required for reproducibility but there is only a certain amount of time and effort before we give up, so a bundled version of the environment and data is very helpful (I would say essential) to real world reproducibility.

What is best practice in phylogenetic data storage?

The best practice is to retain ALL information in a phylogenetic analysis. There have been a number of articles and posts suggesting the minimal or ideal information for phylogeneticists to record. Forcing the user to choose which data to retain or discard is introducing friction and error to the process. Friction is the enemy of science, easy things get done, frustrating things don’t even if they are important. Ideally reproducibility-best-practice is something that would just happen without user intervention, and omit nothing.

We could learn a lot from computer backup strategies. Lots has been written on this, and a very powerful message is: automate the backup of all your files. If you have to remember to backup, if you have to make time, if you have to choose, you won’t do it well enough or often enough. Backup of ALL data is something that should happen very regularly in the background as default behaviour in both phylogenetics experiments and life.

What is best practice in phylogenetic data sharing?

This is slightly different from above. Sharing involves easy archiving in open public repositories such that users can access and reproduce. For example a zip file for easy upload to FigShare, Dryad or similar. In future maybe this could happen frictionlessly with phylogenetic scripts archiving the data directly using the database’s APIs.

What is best practice in phylogenetic environment sharing?

The analysis may be very reproducible indeed on my computer, but if you are missing a crucial program dependency, or have a different version of Python, or some other part of the environment, it may be impossible to replicate the experiment or even run the scripts. Maybe we need to store and share the computer environment in which the analysis was run. This can be done via virtual machines (VM) which save the environment (eg whole Linux system) and allow you to run it from within whatever operating system you are working. Docker is similar to a VM but “lighter” and doesn’t require you to install the entire operating system, just the parts required. There is a lot of excitement around container systems (like Docker) in computing at the moment, this is not a technology likely to disappear. Currently I think best practice for reproducibility is to provide a Docker container with ALL scripts data and computational environment to run them exactly as they did locally. This environment is not static of course, the new user can update the version of software as normal, and then compare any changes to the original.


In the next post I’m going to talk about some options that already exist to implement reproducible phylogenetics.


Bergman C 2012 On the Preservation of Published Bioinformatics Code on Github. Does Casey’s blog have a name? https://caseybergman.wordpress.com/2012/11/08/on-the-preservation-of-published-bioinformatics-code-on-github/

Lunt DH 2013 How can we ensure the persistence of analysis software? EvoPhylo blog http://www.davelunt.net/evophylo/2013/03/software-persistence/

Morrison D 2013 Archiving of bioinformatics software. The Genealogical World of Phylogenetic Networks blog http://phylonetworks.blogspot.co.uk/2013/07/archiving-of-bioinformatics-software.html