Feb 112009

The New York Times has an article talking about constructing and especially visualizing the tree of life called “Crunching the Data for the Tree of Life“. Its interesting, especially since I think it touches on many issues concerning tree size that even phylogenetic biologists haven’t really considered. There are lots of talk of “big” trees, sometimes only a few thousand OTUs, and a new tree of plants containing 13,533 species[1]. Carl Zimmer over on the Loom writes that this is the biggest tree he knows of. It might be the biggest published tree I know of too, but Morgan Price on the FastTree site has a 16S rDNA tree to download containing “186,743 distinct sequences”. Its 48MB when compressed. It will be interesting to hear of strategies to visualize a tree of this size while still mantaining associated information. The temptation I’m sure will be just to make it pretty, but not ultimately very useful. ARB can display trees this size (I think) although I still haven’t got to grips with automated collapsing and labelling of groups yet.

The Smith paper looks really interesting, but I’ve only had chance to skim it so far.

[1] Stephen A Smith , Jeremy M Beaulieu and Michael J Donoghue
Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches
BMC Evolutionary Biology 2009, 9:37 doi:10.1186/1471-2148-9-37

Oct 292008

I downloaded some datasets from the SILVA96 database. These are structurally aligned SSU rDNA sequences. I browsed through the taxonomic groups and chose annelids (N=1050) and nematodes (N=5048) as smallish tests. I downloaded these as fasta files.

I started with the annelids file. The file contain a LOT of gaps, because it comes from an alignment of hundreds of thousands of sequences of all three domains of life.
I haven’t yet found a good way to process large files to remove columns that are all gaps. It can be done in Clustal and Mesquite but these are bad choices with very large alignments. There are some online resources but my fasta files are >50-250MB, so online is not the place even if I could persuade a server to upload my files. I should really have used BioPerl SimpleAlign to remove gap columns, its probably the most flexible and able to deal with big files, but I was temporarily having trouble installing BioPerl on my desktop (a future post) and ran out of time and patience.

I ran it through Gblocks instead which does more than just remove blank columns, also trimming areas of poor alignment judge by various criteria. This reduced the file considerably.

I had previously installed FastTree, so I ran it with the command

fasttree -nt annelids.fasta >annelids.tree

It ran quite nicely and produced a viable tree.
Something strange with the timings though.

Topology done after 1242.20 sec -- computing support values
Unique: 3137/5048 Bad splits: 37/3134 Hill-climb: 259 Update-best: 11335 NNI: 4149
Top hits: close neighbors 2510/3137 refreshes 176
Time 1577.05 Distances per N*N: by-profile 0.220 (out 0.065) by-leaf 0.291
END:    2008-10-28 18:23:32
Runtime:     5886         seconds
Runtime:     01:38:06     h:m:s

The text starting with “END:” is the output of my perl script before that from fasttree. So fasttree claims to have taken 1577 seconds (26 minutes) but my script times it at 1 hour 38 minutes. I actually noted the time it started and it did take 1 hour 38 mins. I repeated with identical results. Strange discrepancy.

Oct 292008

The prediction I made before about a long silence once this year’s students turned up was sadly accurate. Anyway, students dealt with, grant proposal submitted, lectures (mostly) given, bureaucracy reduced (a bit), time to get on with some phylogenetics.

I was playing before with FastTree. Although it looks to have been quite well tested by its developers its always worth investigating real-world experiences. Last time I put together a little perl script to time the runs, but then I noticed that it actually reports the number of seconds taken. Hmmm did I just miss that all along or did it appear in version 1.0 and I didn’t notice. Oh well, either way I must pay more attention.

Morgan Price (the developer of FastTree) has released a version a few days ago that should compile without the malloc error I talked about before.

Sep 262008

This is how I downloaded, compiled and got FastTree working. Its a bit obvious in places but I think detailed instructions are a good thing to have out there and Google findable. I am using a multicore MacPro 2.8GHz with 4GB RAM and OSX 10.5.4 (I’m not sure the 8 cores make any difference whatsoever if the code isn’t written to take account of them).

  • I downloaded FastTree from www.microbesonline.org/fasttree.
  • I had to use Safari to do this as Firefox wouldn’t let me right click and download.
  • FastTree 1.0.0 is available as binaries for Windows and Linux. Unfortunately there are no Mac binaries
  • I downloaded the C code file (156kb)
  • I installed the developer tools “XcodeTools.mpkg” from system software install DVD number 2. It took about 15 minutes. This allowed me to use the gcc compiler to actually make the application.
  • I opened terminal, moved to the location of the c file and issued the command: gcc -lm -O2 -Wall -o FastTree FastTree.c
  • It didn’t like that much and gave the error: FastTree.c:212:20: error: malloc.h: No such file or directory

After some Google searching I came across a couple of indications that malloc.h (I had no idea what this was) was outdated.

malloc.h not supported, use stdlib.h (http://developer.apple.com/technotes/tn2002/tn2071.html)

Most helpful was this:

Mac OS X for unix geeks
5.1.2. malloc.h

make may fail in compiling some types of Unix software if it cannot find malloc.h. Software designed for older Unix systems may expect to find this header file in /usr/include; however, malloc.h is not present in this directory. The set of malloc( ) function prototypes is actually found in stdlib.h. For portability, your programs should include stdlib.h instead of malloc.h. (This is the norm; systems that require you to use malloc.h are the rare exception these days.) GNU autoconf will detect systems that require malloc.h and define the HAVE_MALLOC_H macro. If you do not use GNU autoconf, you will need to detect this case on your own and set the macro accordingly. You can handle such cases with this code:


  • So I opened the C file in a text editor and searched for malloc.h, I found it on line 212. I then deleted that line (#include {malloc.h}) and inserted the four lines from above.
  • I repeated the gcc command to build the application from above and it worked. No errors and it produced the application in 1 second.
  • I tried to launch it and display the help file using the terminal command: ./FastTree -h
  • It didn’t work at all, so for no logical reason I just dumped the application and rebuilt it with the same gcc command. This time ./FastTree -h did launch the application and it displayed the help file. Success!
  • Using the simple instructions from the FastTree page I tried to run it on some lizard DNA sequences I had lying around. These were fasta sequences, although it claims to work on phylip also. The command was: ./FastTree -nt lizards.fasta > lizards.tre
  • It produced a tree with only the first taxon name and nothing else. The input file had mac line endings though and when I corrected that to unix it was fine.
  • I noticed that some of the names were truncated and wondered if they had been chopped at spaces. I replaced with underscores and got better results. [In the spirit of full disclosure these last two points were with the previous version of FastTree and I didn't try to replicate these errors with version 1.0.0]
  • Another problem I had with the previous version was when I accidentally had a repeated taxon in the matrix. It complained about a “non unique name X in the alignment” and wrote an empty treefile.
  • Having got around these teething problems it ran perfectly. Almost instantly writing a treefile (input fasta N=112, 923bp). The tree looked quite sensible and it had support values (local bootstraps). These are on a 0 to 1 scale, so 0.95 is 95%

Fairly straight forward (except for the malloc error) on the whole. Next I’m going to report on some bigger runs and start timing them properly. My immediate goal is to get a tree of all ~20k metazoan 18s rDNA sequences. Of course a tree of that size will bring its own problems, how to visualize it.

Sep 262008

So when I started writing this blog I thought I would use it to outline some of the things I was working on as I went along. Not real projects, which I will write up and publish, but side projects and how I got them to work (or otherwise). Unfortunately there hasn’t been much of that, lots of reviews and comment instead. Hopefully I am going to change that now. 

Last April I posted about FastTree, the rapid NJ application that seems to be able to handle approx 40,000 OTUs. The authors say 
“FastTree inferred a phylogeny for an alignment of 39,092 proteins, including support values, in half an hour on a desktop PC”.

I actually compiled and started using it back in April, but got swamped by teaching and theses and stuff and never posted anything more. My next few posts are going to be about getting FastTree to work and how it copes with some of my datasets. Alternatively as the new crop of undergraduates arrive next week there may be a long silence instead.
Jul 312008

I came across a nice program by Heroen Verbruggen called TreeGradients.

“TreeGradients is a tree drawing program. The tree drawing options are fairly basic but the program has the ability to plot several types of continuous variables at the nodes in colors and use linear color gradients to fill the branches between nodes. The output format is SVG (scalable vector graphics), which can be imported in most vectorial drawing software.”

It looks like Heroen is particularly interested in plotting continuous variables across trees. The part that immediately interested me was the ability to colour internal nodes by bootstrap (or Bayesian) support. In the example on the website poor support is given by pale greys along a gradient to strong support as black. When dealing with very large trees this is a nice visual trick to focus the mind on areas that are well supported and away from poorly supported areas (by making these less visible). Colours and presence of numeric bootstrap values can be adjusted to taste. The program is actually a pair of perl scripts distributed under an open-source GNU General Public Licence. I want to congratulate the author for making these open-source.

I haven’t actually tried it out yet but thought I’d flag it up now rather than my usual habit of waiting and waiting until I could review it properly (and my backlog is running at about 6 months now).

Apr 162008

There is a new release (94) of the SILVA database of ribosomal DNA sequences. 23,133 Metazoans, 88,997 Eukaryotes and 606,879 SSU sequences in total.
I’m having a few problems installing ARB on a new machine but need to start exploring this phenomenal phylogenetic resource more closely. One thing I would love when browsing large trees purely for fun is to see pictures at the tips. I wonder if you could automatically grab taxon images, perhaps from Yahoo, maybe in a similar way to Rod Page’s iSpecies? A few problems with this I guess, but it ought to be quite achievable. It would certainly make the trees much more fun to non taxonomists. The “animal tree of life” paper in Nature that I mentioned in a previous post provokes many more positive comments from my colleagues when they see the tree than I have ever seen with other phylogenies. I suspect it is because the authors include nice drawings at the tips. This simple thing might turn out to be very important for accessibility and dissemination of systematic info.

Apr 132008

There was a message on the excellent EvolDir mailing list a few days ago about FastTree. This is a very fast neighbor-joining program for very large scale phylogenetic analyses. It uses profiles rather than a distance matrix and includes local support values instead of bootstraps. The examples in the preprint manuscript talk about datasets of from 10,610 to 167,547 aligned sequences. This 167k sequence alignment has 7,682 columns and the tree took 95 hours. The 10k dataset took 3 minutes! The preprint manuscript downloadable from the website below is well written and informative. I’m looking forward to testing this, hopefully in the next few days. I’ll post with my experiences. Below is the announcement.
We are pleased to announce the initial release of FastTree, a tool for inferring neighbor joining trees from large alignments. FastTree is capable of computing trees for tens to hundreds of thousands of protein or nucleotide sequences on most desktop computers.

FastTree uses:
*profiles instead of a distance matrix to reduce memory usage
*linear distances with a character dissimilarity matrix
*a new “top hit” heuristic to achieve a sub N-squared run time
*local support instead of bootstrap for node support values

FastTree is faster than computing a distance matrix, and up to 10,000 times faster than neighbor joining with bootstrap. FastTree is about as accurate as BIONJ with log corrected distances for well-supported nodes.
To download source code, binaries, or a preprint, please visit:

Background: A fundamental goal of molecular evolution is to infer the evolutionary history the phylogeny of sequences from their alignment. Neighbor joining, which is a standard method for inferring large phylogenies, takes as its input the distances between all pairs of sequences. The distance matrix requires O(N^2 L) time to compute and O(N^2) memory to store, where N is the number of sequences and L is the width of the alignment. As some families already contain over 100,000 sequences, these time and space requirements are prohibitive.
Results: We show that neighbor-joining can be implemented in O(NLa) space, where ‘a’ is the size of the alphabet, by storing profiles of internal nodes in the tree instead of storing a distance matrix. Profile based neighbor joining allows weighted joins, as in BIONJ, but requires that distances be linear. With heuristic search, neighbor joining with profiles takes only O(N*SQRT(N) log(N)La) time. We estimate the confidence of each split (A,B) vs. (C,D) from the profiles of A, B, C, and D, without bootstrapping. Our implementation, FastTree, has similar accuracy as traditional neighbor joining. FastTree constructed trees, including support values, for biological alignments with 39,092 or 167,547 distinct sequences in less time than it takes to compute the distance matrix and in a fraction of the space. Traditional neighbor joining with 100 bootstraps would be 10,000 times slower.
Conclusions: Neighbor joining with profiles makes it possible to construct phylogenies for the largest sequence families and to estimate their reliability.

Morgan N. Price & Paramvir S. Dehal
Virtual Institute for Microbial Stress and Survival
Arkin Lab
Physical Biosciences Division
Lawrence Berkeley National Lab

Mar 132008

Following on from my previous post I decide to try Google Maps as an interface to large phylogenetic trees. This was a very quick and dirty go at seeing whether it would work as a navigable interface. I tried the implementation at MapLib which allows you to upload your own images and use Google Maps to explore them. So I uploaded some PNG images generated from big ARB trees. It worked quite well. Unfortunately there are some restrictions on image size that can be uploaded to this site so a thorough test of zooming about huge trees wasn’t really possible. But the image here is a screenshot of a smallish tree.

So, I realize that this doesn’t meet many of my own suggestions for getting information from large trees but it does have some interesting possibilities as a simple browser with a good user interface.

Feb 262008

ARB is a database program for sequence data, alignments and trees. It is primarily used by the microbial rDNA community, although it is equally powerful for other genes and taxonomic groups. ARB is my primary productivity software for phylogenetics and I thought I would introduce it briefly.

The ARB software is a graphically oriented package comprising various tools for sequence database handling and data analysis. A central database of processed (aligned) sequences and any type of additional data linked to the respective sequence entries is structured according to phylogeny or other user defined criteria.[http://www.arb-home.de/]
Although it has some irritations, and took me a little effort to install and learn, ARB is the most powerful phylogenetic environment currently available. Yes, there are some great phylogenetic inference softwares (I like RAxML, PhyML) but that isn’t the same thing at all. This is an environment for understanding sequence data and associated information in a phylogenetic context, not just inferring a good tree. My workflow runs something like this-
  • Search GenBank for sequences from taxonomic group of interest.
  • Import entire GenBank records into ARB (add my own sequences)
  • In ARB, align using Clustal and build quick NJ tree
  • Use ARB to add group names (e.g. “Nematoda”) from GenBank/EBI taxonomy or use my own group names (e.g. “very small worms”)
  • Check alignment belonging to “weird looking” phylogenetic groups and branch lengths, edit where necessary
  • Export alignment as newick file and build a good ML tree in RAxML
  • Re-import ML tree to ARB and transfer group names from previous annotated tree
  • Ponder
I find it incredibly useful to have the full GenBank record available by clicking on the tips of the tree, and user-defined tip names (such as “Genus species isolate_source accession_number”) drawing on info from the full record. ARB deals very well with tens of thousands of sequences. The alignment editor groups together sequences that have been grouped in the tree. Because these have an editable consensus sequence that propagates to all contained sequences it is feasible to quickly check and edit thousands of sequences.

Can you imagine having 10,000 sequences in an alignment editor? How long would it take you to check the alignment and make minor corrections? What about the tree of those 10,000 sequences? Are the names sensible? Can you include the accession numbers, or contract the genus name to a single letter without rebuilding the tree? How long does it take to scroll through your tree? What taxonomic info are you actually seeing when you scroll, is it just a blur of names, are you just relying on your memory of what species you are looking at or is there node labelling and group collapsing to help?
Although ARB is far from perfect it is powerful and well designed. I don’t really see alternatives out there for dealing with lots of sequences and keeping all the data about those OTUs accesible. Its what I’m going to be using to explore building (and understanding) trees with lots of tips.

I’m actually using a very old version at the moment. There was a new version released in December 07, but I’m waiting for my new machine to arrive before I install it. I’m looking forward to checking it out.

Ludwig et al. (2004) ARB: a software environment for sequence data. Nucleic Acids Research. 32(4):1363-1371. doi:10.1093/nar/gkh293