Aug 272012
 

I’ve been reading a lot recently about reproducible research (RR) in bioinformatics on several blogs, and Google+ and Twitter. The idea is that it is important that someone is easily able to reproduce* your results (and even figures) from your publication using your provided code and data. I’ve been thinking that this is a movement that urgently needs to spread to phylogenetics research- Reproducible Research in Phylogenetics or RRphylo.

The current state of affairs

The problem is that although methods sections of phylogenetics papers are typically fairly clear, and probably provide all the information required to pretty much replicate the work, it would be a very time-consuming process- lots of hands on time, lots of manual data manipulation. Moreover, many of the settings are assumed (defaults) rather than explicitly specified. ‘Have I replicated what they did?’ would then be judged by a qualitative assessment of whether your tree looked like the one published, and thats not really good enough.

Why it matters

I’m assuming here that what is currently published will allow replication, though in some cases it might not, as Ross Mounce described in his blog. I have had experience of this too, but thats another long and painful story. So why does it matter? It matters because of the next step in research. If the result of your work is important you or someone else will probably want to add genes or species to the analysis, vary the parameters of phylogenetic reconstruction, or alignment, or the model of sequence evolution, or reassess the support values, or something else relevant. Unless the process of analysis has been explicitly characterised in a way that allows replication without extensive manual intervention and guesswork then this cannot be achieved. If you want your work to be a foundation for rapid advancement, rather than just a nice observation, then it must be done reproducibly.

What should be done?

Briefly; pipelines, workflows and/or script-based automation. It is quite possible to create a script-based workflow that recreates your analyses in their entirety, perhaps using one of the Open-Bio projects (e.g. BioPerl, BioPython) or DendroPy, or Hal. This would go from sequence download, manipulation and alignment (though this could be replaced by provision of an alignment file), to phylogenetic analysis, to tree annotation and visualisation. Such a script must, by definition, include all the parameters used at each step- and that is mostly why I prefer that it includes sequence retrieval and manipulation rather than an alignment file.

Phylogeneticists are sometimes much less comfortable with scripting than are bioinformaticians, but this is something that has no option but to change. The scale of phylogenomic data now appearing just cannot (and should not) be handled in the GUI packages that have typically enabled tree building (e.g. MEGA, DAMBE, Mesquite).

There is a certain attraction to GUI approaches though and something that may well increase are GUI workflow builders. The figure above is from Armadillo, which looks interesting, but unfortunately doesn’t seem to be able to save workflows in an accessible form, making it an inappropriate way forward. Galaxy is another good example, able to save standard workflows, but not (yet) well provisioned for phylogenetic analyses.

At the moment  a script-based approach linking together analyses is the best approach for RRphylo.

‘So you’ve been doing this, right?’

Err no, I’ve been bad, really I have. Some of my publications are very poor from the view of RRphylo. The reasons for that would take too long to go into, I’m not claiming the moral high ground here, but past mistakes do give me an appreciation of both the need and the challenges in implementation. I absolutely see that this is how I will have to do things in future, not just because (I hope) new community standards will demand it, but also because iterating over minor modifications of analysis is how good phylogenetics is done, and that is best implemented in an automated way.

Automated Methods sections for manuscripts?

One interesting idea, that is both convenient and rigorous, is to have the analysis pipeline write the first draft of your Methods section for you. An RRphylo script that fetched sequences from genbank, aligned them, built a phylogeny, and then annotated a tree figure should be able to describe this itself in a human-readable text format suitable for your manuscript.

The full Methods pipeline is archived at doi:12345678 and is described briefly as follows: sequences were downloaded from NCBI nucleotide database (27 August 2012) with the Entrez query cichlidae[ORGN] AND d-loop[TITL] and 300:1600[SLEN]. These 5,611 sequences were aligned with MUSCLE v1.2.3 with settings……

This is a brief and made up Methods that doesn’t contain the detail it could. Parameter values from the script have been inserted in blue. This sort of an output could also fit very well with the MIAPA project (Minimum Information About a Phylogenetic Analysis). NB it is not this methods information that needs to be distributed, it is the script that carried out these analyses (and produced this human-readable summary as an extra output) that is the real record of distribution.

Implementation

This won’t be implemented tomorrow, even if everyone immediately agrees with me that it is really important. It is much easier for most people to just write the same old methods section they always have- a general description of what they did that people in the field will understand. I went today to read a lot of Methods sections from phylogeny papers. Some were better than others in describing the important details, but none sounded relevant to the new era of large scale analysis. They sounded like historical legacies, which of course is true of scientific paper writing in general.

It will take a community embarrassment to effect a change; an embarrassment that even the best papers in the field are still vague, still passing the methodological burden to the next researcher, still amateur compared to modern bioinformatics papers, still ultimately irreproducible.

The major barrier to RRphylo is the need to write scripts, a skill with which many phylogeneticists are unfamiliar and uncomfortable. This may be helped by Galaxy or the like, allowing the easy GUI linking of phylogenetic modules and publication of a standard format workflows to MyExperiment (I think Galaxy is the future, for reasons I won’t go into here). Alternatively, maybe some cutting-edge labs will put together a lot of scripts and user-guides allowing even moderately computer literate phylogeneticists to piece together a reproducible workflow. Hal and DendroPy seem the places to start at present, and I shall have to try them out as soon as I can. Other places for workflows that are worth investigating are Ruffus, Sumatra, and Snakemake. At the moment I’ve done a decent amount of Googling and absolutely no testing so I’d be really interested in other suggestions and views on these options.

I think that Reproducible Research in Phylogenetics is incredibly important. Not just to record exactly what you did, not just to make things easier for the next researcher, but because all science should be fully reproducible- of course it should. I’m coming round to the idea that not implementing RRphylo is equivalent to not releasing your new analysis package and just describing what the program does. But maybe I’m just a lone voice?

See also 

Our approach to replication in computational science, C Titus Brown

Reproducible Research: A Bioinformatics Case Study Robert Gentleman 2004

Next-generation sequencing data interpretation: enhancing reproducibility and accessibility‘ Nekrutenko & Taylor Nature Reviews Genetics 13, 667-672 (September 2012) | doi:10.1038/nrg3305 (subscription required)

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Goeks et al Genome Biology 2010, 11:R86 doi:10.1186/gb-2010-11-8-r86 (open access)

Reproducible research with bein, BiocodersHub

Sumatra also talks about pipelines acting as an Electronic Lab Book, here’s a presentation about it.

* I am not going to distinguish between replication and reproducibility in this blog post. See here. There are differences, but both are needed.

Oct 112010
 

I just received promotional information about a new book from Garland Science publishers. “Genome Duplication; concepts, mechanisms, evolution and disease” By Melvin L DePamphilis and Stephen D Bell. Garland Science Oct 2010 ISBN: 978-0-415-44206. It sounds like a great title, especially for someone like me who thinks genome and gene duplication are among the most important processes in the whole of evolutionary biology.

Unfortunately, on the cover of the book it looks like they have drawn a tree of some model organisms and placed Drosophila melanogaster in a monophyletic group with Arabidopsis thalianato the exclusion of all other animals. Ooops.

You would have thought that a big publisher would have checked more carefully before creating a book cover image so obviously wrong, but I guess we all have bad days. Interesting to see what happens next.
Oct 292008
 

I downloaded some datasets from the SILVA96 database. These are structurally aligned SSU rDNA sequences. I browsed through the taxonomic groups and chose annelids (N=1050) and nematodes (N=5048) as smallish tests. I downloaded these as fasta files.

I started with the annelids file. The file contain a LOT of gaps, because it comes from an alignment of hundreds of thousands of sequences of all three domains of life.
I haven’t yet found a good way to process large files to remove columns that are all gaps. It can be done in Clustal and Mesquite but these are bad choices with very large alignments. There are some online resources but my fasta files are >50-250MB, so online is not the place even if I could persuade a server to upload my files. I should really have used BioPerl SimpleAlign to remove gap columns, its probably the most flexible and able to deal with big files, but I was temporarily having trouble installing BioPerl on my desktop (a future post) and ran out of time and patience.

I ran it through Gblocks instead which does more than just remove blank columns, also trimming areas of poor alignment judge by various criteria. This reduced the file considerably.

I had previously installed FastTree, so I ran it with the command

fasttree -nt annelids.fasta >annelids.tree

It ran quite nicely and produced a viable tree.
Something strange with the timings though.

Topology done after 1242.20 sec -- computing support values
Unique: 3137/5048 Bad splits: 37/3134 Hill-climb: 259 Update-best: 11335 NNI: 4149
Top hits: close neighbors 2510/3137 refreshes 176
Time 1577.05 Distances per N*N: by-profile 0.220 (out 0.065) by-leaf 0.291
END:    2008-10-28 18:23:32
----------------------------------------------
Runtime:     5886         seconds
Runtime:     01:38:06     h:m:s
----------------------------------------------

The text starting with “END:” is the output of my perl script before that from fasttree. So fasttree claims to have taken 1577 seconds (26 minutes) but my script times it at 1 hour 38 minutes. I actually noted the time it started and it did take 1 hour 38 mins. I repeated with identical results. Strange discrepancy.

Oct 292008
 

The prediction I made before about a long silence once this year’s students turned up was sadly accurate. Anyway, students dealt with, grant proposal submitted, lectures (mostly) given, bureaucracy reduced (a bit), time to get on with some phylogenetics.

I was playing before with FastTree. Although it looks to have been quite well tested by its developers its always worth investigating real-world experiences. Last time I put together a little perl script to time the runs, but then I noticed that it actually reports the number of seconds taken. Hmmm did I just miss that all along or did it appear in version 1.0 and I didn’t notice. Oh well, either way I must pay more attention.

Morgan Price (the developer of FastTree) has released a version a few days ago that should compile without the malloc error I talked about before.

Sep 262008
 

This is how I downloaded, compiled and got FastTree working. Its a bit obvious in places but I think detailed instructions are a good thing to have out there and Google findable. I am using a multicore MacPro 2.8GHz with 4GB RAM and OSX 10.5.4 (I’m not sure the 8 cores make any difference whatsoever if the code isn’t written to take account of them).

  • I downloaded FastTree from www.microbesonline.org/fasttree.
  • I had to use Safari to do this as Firefox wouldn’t let me right click and download.
  • FastTree 1.0.0 is available as binaries for Windows and Linux. Unfortunately there are no Mac binaries
  • I downloaded the C code file (156kb)
  • I installed the developer tools “XcodeTools.mpkg” from system software install DVD number 2. It took about 15 minutes. This allowed me to use the gcc compiler to actually make the application.
  • I opened terminal, moved to the location of the c file and issued the command: gcc -lm -O2 -Wall -o FastTree FastTree.c
  • It didn’t like that much and gave the error: FastTree.c:212:20: error: malloc.h: No such file or directory

After some Google searching I came across a couple of indications that malloc.h (I had no idea what this was) was outdated.

malloc.h not supported, use stdlib.h (http://developer.apple.com/technotes/tn2002/tn2071.html)

Most helpful was this:

Mac OS X for unix geeks
http://unix.compufutura.com/mac/ch05_01.htm
5.1.2. malloc.h

make may fail in compiling some types of Unix software if it cannot find malloc.h. Software designed for older Unix systems may expect to find this header file in /usr/include; however, malloc.h is not present in this directory. The set of malloc( ) function prototypes is actually found in stdlib.h. For portability, your programs should include stdlib.h instead of malloc.h. (This is the norm; systems that require you to use malloc.h are the rare exception these days.) GNU autoconf will detect systems that require malloc.h and define the HAVE_MALLOC_H macro. If you do not use GNU autoconf, you will need to detect this case on your own and set the macro accordingly. You can handle such cases with this code:

#include
#ifdef HAVE_MALLOC_H
#include
#endif

  • So I opened the C file in a text editor and searched for malloc.h, I found it on line 212. I then deleted that line (#include {malloc.h}) and inserted the four lines from above.
  • I repeated the gcc command to build the application from above and it worked. No errors and it produced the application in 1 second.
  • I tried to launch it and display the help file using the terminal command: ./FastTree -h
  • It didn’t work at all, so for no logical reason I just dumped the application and rebuilt it with the same gcc command. This time ./FastTree -h did launch the application and it displayed the help file. Success!
  • Using the simple instructions from the FastTree page I tried to run it on some lizard DNA sequences I had lying around. These were fasta sequences, although it claims to work on phylip also. The command was: ./FastTree -nt lizards.fasta > lizards.tre
  • It produced a tree with only the first taxon name and nothing else. The input file had mac line endings though and when I corrected that to unix it was fine.
  • I noticed that some of the names were truncated and wondered if they had been chopped at spaces. I replaced with underscores and got better results. [In the spirit of full disclosure these last two points were with the previous version of FastTree and I didn't try to replicate these errors with version 1.0.0]
  • Another problem I had with the previous version was when I accidentally had a repeated taxon in the matrix. It complained about a “non unique name X in the alignment” and wrote an empty treefile.
  • Having got around these teething problems it ran perfectly. Almost instantly writing a treefile (input fasta N=112, 923bp). The tree looked quite sensible and it had support values (local bootstraps). These are on a 0 to 1 scale, so 0.95 is 95%

Fairly straight forward (except for the malloc error) on the whole. Next I’m going to report on some bigger runs and start timing them properly. My immediate goal is to get a tree of all ~20k metazoan 18s rDNA sequences. Of course a tree of that size will bring its own problems, how to visualize it.

Sep 262008
 

So when I started writing this blog I thought I would use it to outline some of the things I was working on as I went along. Not real projects, which I will write up and publish, but side projects and how I got them to work (or otherwise). Unfortunately there hasn’t been much of that, lots of reviews and comment instead. Hopefully I am going to change that now. 

Last April I posted about FastTree, the rapid NJ application that seems to be able to handle approx 40,000 OTUs. The authors say 
“FastTree inferred a phylogeny for an alignment of 39,092 proteins, including support values, in half an hour on a desktop PC”.

I actually compiled and started using it back in April, but got swamped by teaching and theses and stuff and never posted anything more. My next few posts are going to be about getting FastTree to work and how it copes with some of my datasets. Alternatively as the new crop of undergraduates arrive next week there may be a long silence instead.
Sep 022008
 

I just scanned through a very interesting article in BMC Bioinformatics discussing the results of a data mining approach to describing phylogenetic methodology in published articles.

Eales et al. Methodology capture: discriminating between the “best” and the rest of community practice. BMC Bioinformatics 2008, 9:359 doi:10.1186/1471-2105-9-359

They searched for “phylogen*” in titles and abstracts at PubMed, downloaded PDFs and converted to text. This was successful for 21,484 articles. They analysed those published after 1996 (about 90% of full dataset) employing data mining techniques to extract the protocol employed for phylogenetic reconstruction. 723 journals were represented by the 17,732 protocols successfully extracted.

“We found that 17% (3,712 articles) of articles were published in evolutionary biology journals, 22% (4,625 articles) were published in microbiology or bacteriology journals and 11% (2,274 articles) were from journals related to virology. The remaining 50% (10,873 articles) were published in a wide variety of fields.”

This is staggering, at least to me. In 12 years over 21 thousand articles have prominently talked about phylogenetics, and this is very broadly distributed, being found in over 700 different scientific journals. Remember these figures for next time a colleague is dismissive of “tree building”! It is clear that phylogenetic approaches are truly pervasive in modern biology.

The authors found differences in methods used between the three groups mentioned above- evolutionary biology, microbiology and virology. One difference is in use of Bayesian analysis

“Over 60% of evolutionary biology articles published in 2005 included one or more references to a term describing Bayesian phylogenetic analysis of some kind, this compares to 5% of microbiology and 11% of virology articles.”

This is not specifically about Bayesian analysis per se. It is highlighted by the authors as an example that evolutionary biology tends to take new analysis techniques faster than microbiology or virology. Something that interested me was the comment

“Almost all of the 10 protocols used most commonly by the phylogenetics community represent a valid choice (except those using UPGMA [see 31, 32]) for a researcher new to the field.”

Obviously “valid choice” could be seen as a little subjective, although it is almost certainly correct, the methods may sometimes be old-fashioned but they have been peer reviewed and are OK. What attracted my attention though was UPGMA, are there really people still using UPGMA? Then I started to think about all the genetics and bioinformatics textbooks I’ve seen that have a section on phylogenetic analysis. UPGMA is almost ubiquitous and little analysis is discussed beyond the venerable PHYLIP.

An interesting paper.
—-
References from the quote above concerning UPGMA
31. Leitner T, Escanilla D, Franzen C, Uhlen M, Albert J: Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc Natl Acad Sci USA 1996.
32. Huelsenbeck JP: Performance of Phylogenetic Methods in Simulation. Systematic Biology 1995, 44(1):17-48.

Aug 222008
 

There has been discussion and research for decades into support values for phylogenetic nodes and the relative quality of different phylogenies as a whole. Here is a new and impressive (although clearly subjective) criteria for confidence in a phylogenetic tree- “I am so confident in this tree I have had it tattooed onto my body”!!!
The Loom has been posting pictures in the Science Tattoo Emporium for a while now and I noticed that there are 4 phylogeny tattoos. So ask yourself, how confident are you in that tree you just built? Confident enough to show it to a colleague? Confident enough to publish it? Or confident enough to live with it forever?

But what if further analysis, or data collection, shows it to be wrong? This happens frequently, even with apparently very good quality trees. Do you then say to your grandchildren “yes this tattoo is how we saw eukaryotic phylogeny in 2004. Its wrong of course, pass me that pen and I’ll fix it to show you where the basal amoebae really fall“.

Some trees of course don’t really have to be right, such as Darwin’s first tree diagram, or Haeckel’s Tree of Life, they are just beautiful and important.
Others, such as the one pictured at the top of the blog, are more specific, depicting the “5 kingdoms” or this one on the evolutionary history of HIV. Truly impressive, but after some thought, maybe not for me.

Apr 132008
 

There was a message on the excellent EvolDir mailing list a few days ago about FastTree. This is a very fast neighbor-joining program for very large scale phylogenetic analyses. It uses profiles rather than a distance matrix and includes local support values instead of bootstraps. The examples in the preprint manuscript talk about datasets of from 10,610 to 167,547 aligned sequences. This 167k sequence alignment has 7,682 columns and the tree took 95 hours. The 10k dataset took 3 minutes! The preprint manuscript downloadable from the website below is well written and informative. I’m looking forward to testing this, hopefully in the next few days. I’ll post with my experiences. Below is the announcement.
———————————————————————
We are pleased to announce the initial release of FastTree, a tool for inferring neighbor joining trees from large alignments. FastTree is capable of computing trees for tens to hundreds of thousands of protein or nucleotide sequences on most desktop computers.

FastTree uses:
*profiles instead of a distance matrix to reduce memory usage
*linear distances with a character dissimilarity matrix
*a new “top hit” heuristic to achieve a sub N-squared run time
*local support instead of bootstrap for node support values

FastTree is faster than computing a distance matrix, and up to 10,000 times faster than neighbor joining with bootstrap. FastTree is about as accurate as BIONJ with log corrected distances for well-supported nodes.
To download source code, binaries, or a preprint, please visit:
http://www.microbesonline.org/fasttree/

Abstract
Background: A fundamental goal of molecular evolution is to infer the evolutionary history the phylogeny of sequences from their alignment. Neighbor joining, which is a standard method for inferring large phylogenies, takes as its input the distances between all pairs of sequences. The distance matrix requires O(N^2 L) time to compute and O(N^2) memory to store, where N is the number of sequences and L is the width of the alignment. As some families already contain over 100,000 sequences, these time and space requirements are prohibitive.
Results: We show that neighbor-joining can be implemented in O(NLa) space, where ‘a’ is the size of the alphabet, by storing profiles of internal nodes in the tree instead of storing a distance matrix. Profile based neighbor joining allows weighted joins, as in BIONJ, but requires that distances be linear. With heuristic search, neighbor joining with profiles takes only O(N*SQRT(N) log(N)La) time. We estimate the confidence of each split (A,B) vs. (C,D) from the profiles of A, B, C, and D, without bootstrapping. Our implementation, FastTree, has similar accuracy as traditional neighbor joining. FastTree constructed trees, including support values, for biological alignments with 39,092 or 167,547 distinct sequences in less time than it takes to compute the distance matrix and in a fraction of the space. Traditional neighbor joining with 100 bootstraps would be 10,000 times slower.
Conclusions: Neighbor joining with profiles makes it possible to construct phylogenies for the largest sequence families and to estimate their reliability.

Morgan N. Price & Paramvir S. Dehal
fasttree@microbesonline.org
Virtual Institute for Microbial Stress and Survival
Arkin Lab
Physical Biosciences Division
Lawrence Berkeley National Lab
psdehal@gmail.com
———————————————————————

Mar 092008
 

While I was reading the Nature paper I was talking about in my last post I was thinking about the use of the term “phylogenomics”. It seems like there are two quite separate contexts where it is used.
(1) Integrating evolutionary biology into genomics (2) phylogenetics using a lot of data

The term “phylogenomics” was first published in 1998 by Jonathan Eisen. It is clear he was talking about option 1 not 2. In a recent blog post he says

“‘phylogenomic’ analysis in the way I think of phylogenomics — that is — a integration of evolutionary and genomic analyses. (NOTE – I think it is kind of lame that people use the term phylogenomics, which I coined by the way, to refer to “using genomes to infer evolutionary trees).”

So thats clear then? Well unfortunately not. There are so many publications now using it in the second context that irrespective of the initial meaning it looks like it now has both. The wikipedia (stub) also includes both meanings.

Web of Science reveals 92 publications in 2007 with the term phylogenomic*. I didn’t go through them and work out the split but both types of phylogenomics are there in numbers. WoS lists 272 publications with “phylogenomic*” across all years (searched 9th March 2008).

In my short description above I used “lots of data” rather than whole genomes. There are relatively few type 2 papers that actually use whole genomes. Thats understandable I guess, especially for Metazoa. The Nature paper I was talking about has a data matrix of 150 genes. Although the mean is only ~50% data representation for any taxon. Thats still a huge amount of data, but its not whole genome analysis. I wonder what the smallest dataset is that self-applies the term phylogenomics?

I am not exactly clear the difference between type 2 phylogenomics and supermatrix approaches to phylogenetics. How big does a matrix have to be before its super?Ultimately the use of the same name for different things may not matter much. I see little evidence people are very confused.

Here are some papers of interest-

Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis.
Eisen JA
Genome Res. 1998 Mar;8(3):163-7

Phylogenomics: Intersection of Evolution and Genomics
Jonathan A. Eisen and Claire M. Fraser
Science 13 June 2003 Vol. 300. no. 5626, pp. 1706 – 1707
DOI: 10.1126/science.1086292

Phylogenomics and the reconstruction of the tree of life
Delsuc F, Brinkmann H, Philippe H
NATURE REVIEWS GENETICS Volume: 6 Issue: 5 Pages: 361-375 Published: MAY 2005

The supermatrix approach to systematics
de Queiroz A, Gatesy J
TRENDS IN ECOLOGY & EVOLUTION Volume: 22 Issue: 1 Pages: 34-41 Published: JAN 2007