Feb 112009

The New York Times has an article talking about constructing and especially visualizing the tree of life called “Crunching the Data for the Tree of Life“. Its interesting, especially since I think it touches on many issues concerning tree size that even phylogenetic biologists haven’t really considered. There are lots of talk of “big” trees, sometimes only a few thousand OTUs, and a new tree of plants containing 13,533 species[1]. Carl Zimmer over on the Loom writes that this is the biggest tree he knows of. It might be the biggest published tree I know of too, but Morgan Price on the FastTree site has a 16S rDNA tree to download containing “186,743 distinct sequences”. Its 48MB when compressed. It will be interesting to hear of strategies to visualize a tree of this size while still mantaining associated information. The temptation I’m sure will be just to make it pretty, but not ultimately very useful. ARB can display trees this size (I think) although I still haven’t got to grips with automated collapsing and labelling of groups yet.

The Smith paper looks really interesting, but I’ve only had chance to skim it so far.

[1] Stephen A Smith , Jeremy M Beaulieu and Michael J Donoghue
Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches
BMC Evolutionary Biology 2009, 9:37 doi:10.1186/1471-2148-9-37

Apr 162008

There is a new release (94) of the SILVA database of ribosomal DNA sequences. 23,133 Metazoans, 88,997 Eukaryotes and 606,879 SSU sequences in total.
I’m having a few problems installing ARB on a new machine but need to start exploring this phenomenal phylogenetic resource more closely. One thing I would love when browsing large trees purely for fun is to see pictures at the tips. I wonder if you could automatically grab taxon images, perhaps from Yahoo, maybe in a similar way to Rod Page’s iSpecies? A few problems with this I guess, but it ought to be quite achievable. It would certainly make the trees much more fun to non taxonomists. The “animal tree of life” paper in Nature that I mentioned in a previous post provokes many more positive comments from my colleagues when they see the tree than I have ever seen with other phylogenies. I suspect it is because the authors include nice drawings at the tips. This simple thing might turn out to be very important for accessibility and dissemination of systematic info.

Feb 262008

ARB is a database program for sequence data, alignments and trees. It is primarily used by the microbial rDNA community, although it is equally powerful for other genes and taxonomic groups. ARB is my primary productivity software for phylogenetics and I thought I would introduce it briefly.

The ARB software is a graphically oriented package comprising various tools for sequence database handling and data analysis. A central database of processed (aligned) sequences and any type of additional data linked to the respective sequence entries is structured according to phylogeny or other user defined criteria.[http://www.arb-home.de/]
Although it has some irritations, and took me a little effort to install and learn, ARB is the most powerful phylogenetic environment currently available. Yes, there are some great phylogenetic inference softwares (I like RAxML, PhyML) but that isn’t the same thing at all. This is an environment for understanding sequence data and associated information in a phylogenetic context, not just inferring a good tree. My workflow runs something like this-
  • Search GenBank for sequences from taxonomic group of interest.
  • Import entire GenBank records into ARB (add my own sequences)
  • In ARB, align using Clustal and build quick NJ tree
  • Use ARB to add group names (e.g. “Nematoda”) from GenBank/EBI taxonomy or use my own group names (e.g. “very small worms”)
  • Check alignment belonging to “weird looking” phylogenetic groups and branch lengths, edit where necessary
  • Export alignment as newick file and build a good ML tree in RAxML
  • Re-import ML tree to ARB and transfer group names from previous annotated tree
  • Ponder
I find it incredibly useful to have the full GenBank record available by clicking on the tips of the tree, and user-defined tip names (such as “Genus species isolate_source accession_number”) drawing on info from the full record. ARB deals very well with tens of thousands of sequences. The alignment editor groups together sequences that have been grouped in the tree. Because these have an editable consensus sequence that propagates to all contained sequences it is feasible to quickly check and edit thousands of sequences.

Can you imagine having 10,000 sequences in an alignment editor? How long would it take you to check the alignment and make minor corrections? What about the tree of those 10,000 sequences? Are the names sensible? Can you include the accession numbers, or contract the genus name to a single letter without rebuilding the tree? How long does it take to scroll through your tree? What taxonomic info are you actually seeing when you scroll, is it just a blur of names, are you just relying on your memory of what species you are looking at or is there node labelling and group collapsing to help?
Although ARB is far from perfect it is powerful and well designed. I don’t really see alternatives out there for dealing with lots of sequences and keeping all the data about those OTUs accesible. Its what I’m going to be using to explore building (and understanding) trees with lots of tips.

I’m actually using a very old version at the moment. There was a new version released in December 07, but I’m waiting for my new machine to arrive before I install it. I’m looking forward to checking it out.

Ludwig et al. (2004) ARB: a software environment for sequence data. Nucleic Acids Research. 32(4):1363-1371. doi:10.1093/nar/gkh293