Feb 112009

The New York Times has an article talking about constructing and especially visualizing the tree of life called “Crunching the Data for the Tree of Life“. Its interesting, especially since I think it touches on many issues concerning tree size that even phylogenetic biologists haven’t really considered. There are lots of talk of “big” trees, sometimes only a few thousand OTUs, and a new tree of plants containing 13,533 species[1]. Carl Zimmer over on the Loom writes that this is the biggest tree he knows of. It might be the biggest published tree I know of too, but Morgan Price on the FastTree site has a 16S rDNA tree to download containing “186,743 distinct sequences”. Its 48MB when compressed. It will be interesting to hear of strategies to visualize a tree of this size while still mantaining associated information. The temptation I’m sure will be just to make it pretty, but not ultimately very useful. ARB can display trees this size (I think) although I still haven’t got to grips with automated collapsing and labelling of groups yet.

The Smith paper looks really interesting, but I’ve only had chance to skim it so far.

[1] Stephen A Smith , Jeremy M Beaulieu and Michael J Donoghue
Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches
BMC Evolutionary Biology 2009, 9:37 doi:10.1186/1471-2148-9-37

Jul 312008

I came across a nice program by Heroen Verbruggen called TreeGradients.

“TreeGradients is a tree drawing program. The tree drawing options are fairly basic but the program has the ability to plot several types of continuous variables at the nodes in colors and use linear color gradients to fill the branches between nodes. The output format is SVG (scalable vector graphics), which can be imported in most vectorial drawing software.”

It looks like Heroen is particularly interested in plotting continuous variables across trees. The part that immediately interested me was the ability to colour internal nodes by bootstrap (or Bayesian) support. In the example on the website poor support is given by pale greys along a gradient to strong support as black. When dealing with very large trees this is a nice visual trick to focus the mind on areas that are well supported and away from poorly supported areas (by making these less visible). Colours and presence of numeric bootstrap values can be adjusted to taste. The program is actually a pair of perl scripts distributed under an open-source GNU General Public Licence. I want to congratulate the author for making these open-source.

I haven’t actually tried it out yet but thought I’d flag it up now rather than my usual habit of waiting and waiting until I could review it properly (and my backlog is running at about 6 months now).

Mar 132008

Following on from my previous post I decide to try Google Maps as an interface to large phylogenetic trees. This was a very quick and dirty go at seeing whether it would work as a navigable interface. I tried the implementation at MapLib which allows you to upload your own images and use Google Maps to explore them. So I uploaded some PNG images generated from big ARB trees. It worked quite well. Unfortunately there are some restrictions on image size that can be uploaded to this site so a thorough test of zooming about huge trees wasn’t really possible. But the image here is a screenshot of a smallish tree.

So, I realize that this doesn’t meet many of my own suggestions for getting information from large trees but it does have some interesting possibilities as a simple browser with a good user interface.

Mar 122008

It seems that Genome Projector has swept the blogosphere over the last 24 hours. I’ve seen it listed on many of the blogs I’m reading. It looks very good and intuitive. I just wanted to mention a couple of things.
This is a beautiful example of what happens when open source software is championed. Google maps has essentially been reworked as a genome browser. Beautiful.

“Development API is available for Google Map View! Any image (in almost any format including GIF, PNG, JPEG, BMP and even SVG) of any size can be readily converted to zoomable image using generateGMap() API distributed within the G-language Genome Analysis Environment with open-source GNU General Public License.”

Second, is there any future in the Google API for tree viewing? BioPerl will quickly convert any newick file to SVG. It is clearly mature code. It includes a little inset to see where you are in the genome (tree). Works in most browsers. Zoom and move is the basis of most tree navigation. Search and mark locations (taxa) are already available. It might be worth a look.

Mar 012008

In order to really get information out of building phylogenetic trees (especially large ones) some thought has to be given to how to annotate the tips (OTUs).
The two programs that seem to do this in a powerful way are ARB and Treedyn. I also want to explore Tree-Q vista, which looks promising, but haven’t really had chance yet. (Has anybody got experience with Tree-Q vista?).

Treedyn is a very good program for editing and annotating phylogenetic trees. Its action can be driven by scripts and it can carry out many sophisticated graphical transformations.

“Many powerful tree editors are now available, but existing tree visualisation tools make little use of meta-information related to the entities under study such as taxonomic descriptions, geographic distribution or gene functions. This meta-information is useful for the analyses of trees and their publications, but can hardly be encoded within the tree itself (the so-called newick format). Consequently, a tedious manual analysis and post-processing of the tree’s images is required. Particularly with large trees, multiple trees and multiple meta-information variables. TreeDyn links unique leaf labels to lists of variables/values pairs of annotations (meta-information), independently of the tree topologies, remaining fully compatible with the basic newick format.” [www.treedyn.org]

What information can it be labeled with? The best thing would be to parse the information out of the original GenBank files of the sequences that created the tree. Treedyn allows conditional annotation of OTUs by adding to or replacing the existing names. This can be done from an annotation file where the information is held as “key{value}” pairs, such as accession_number{AY123456}, on a line following the unique name from the newick file.

I wrote a little perl script to do this. This could be done much better using BioPerl. My perl skills are very basic, but it works.

#! usr/bin/perl

# Creates an annotation file for treedyn from a file containing multiple
Genbank files. Annotations are of the form key{value}. Keys must not
# contain spaces.

# usage: genbank2treedyn.pl infile.gb > outfile.tlf

$/ = “//”; # break up records on genbank // delimiter
while (<>) {
/ACCESSION[ ]*(\S+)/; # matches ACCESSION line
$accession = $1;
/AUTHORS[ ]*(\w+),/; # matches first author surname
$author = $1;
/organism=”[ ]*(\S+)[ ]*(\w+).|”/; # matches genus, species
$genus = $1;
$species = $2;
/isolate[ ]*(\S+)/; # matches isolate line
$isolate = $1;

print “$accession \tgenus {$genus} \tspecies {$species} \taccession {$accession} \tisolate {$isolate} \tauthor {$author}\n”;

In addition to tip names Treedyn is able to annotate OTUs with graphical character data, some nice examples on the website.

Of course I also have some grumbles about Treedyn. It doesn’t work properly on Macs, never has. The PC version though seems very stable. The interface is an absolute nightmare. One of the most illogical and confusing I have ever seen. But you can learn to survive it with a little patience. Despite all this the actual functions are well thought out and powerful, even if applying them is difficult sometimes.

The best thing about Treedyn in my opinion is that it is open source.

Feb 292008

I came across SupraMap today. This is a way to overlay phylogenetic trees onto Google Earth images and examine the geographic distribution of the OTUs. Although there have been several postings before at iPhylo, and CIPRES and an implementation for Mesquite, the work at SupraMap looks quite polished. There is a video, and detailed instructions on how to build your own. I haven’t tried it yet, but it looks reasonably straight forward.

I like that they have clearly thought about incorporating data into the trees (different hosts in different colours) and what looks like a clickable interface to get more information. It also looks like an expanding project, I came across them when they were advertising for new programmers on EvolDir.

Feb 042008

What abilities should a phylogenetic visualisation tool have? What is important when you have so many tips (OTUs) that it is too big to print out or even scroll through on the screen? I have several pieces of research in this last category. In no particular order here are some things that seem important to me-

  1. It should still be “snappy” when dealing with tens of thousands of OTUs. I think it should be standalone not web-based for tasks like this.
  2. It should be open-source with an active development community. Can we really keep relying on single program authors for development? No
  3. It must interact with an associated data file. This data file can be common to a number of trees. It could be parsed from GenBank and keep ALL field data plus user data. This data file is essential for data-driven OTU renaming, searching, collapsing and exporting
  4. It should collapse OTUs to groups from an associated data file and name these groups. ie automatically group OTUs into “mammalia”, “rotifera”, “arthropoda”, “diptera”. Collapse and name options could be parsed from GenBank taxonomy. See GRUNT.
  5. It should be able to collapse nodes automatically to form polytomies. These could be clades below a given support value, or below a certain node length.
  6. It should be able to reroot. User-defined clicking on an OTU or clade, midpoint rooting (default)
  7. It should be able to test for monophyly of groups. It could colour these groups accordingly. So if all descendent taxa of a node are called mammalia in the taxonomy file then the group is labeled “mammalia”. If another mammal is found outside of mammalia clade then it is flagged as non-monophyletic.
  8. Should be able to see both the details and the whole picture. At the least click to zoom in and out . So maybe an inset of where in the tree one is and a clickable interface to go somewhere else, is vital. See Rod Page’s ideas on visualisation of large trees on a web page.
  9. It needs to have search facilities. These should be able to search tree and associated data files. Boolean. Find this text string in these fields AND this in that.
  10. User definable tip names. It should be easy to switch between different tip names (taken from the data file), such as accession number, species name, etc etc. Should be able to apply rules to this; if this and that then name tip like this.
  11. It must be able to export reliably, in all tree formats, with appropriately considered tip names etc. As graphics with SVG, PDF, EMF etc supported. Exported graphics must be available in collapsed format too.
  12. It should be scriptable. Its very useful to have the ability to be incorporated in bioinformatics pipeline. So “program open treefile, collapse according to this datafile and criteria, rename tips according to this, export as SVG”.

Am I asking a lot? Not really, all this can be implemented with current code, people just don’t in general. Any suggestions for more? Any stuff you don’t agree with?

Many programs claim to deal with hundreds or thousands of tips on a tree. My cichlid mtDNA tree has approx 4000 OTUs. The NJ tree would, if printed out, fill more than 40 pages. There are several programs that can deal with this and feel reasonably fast, but it is almost impossible to get a meaningful look at the phylogenetic relationships. Too much data on the screen, I can’t see the wood for the trees. It is essential to be able to collapse down the hundreds of almost identical mtDNA sequences coming from Lake Victoria fish and just label the resulting triangle “Victoria Superflock”. Immediately I can start to see their relationship to others without an enormous amount of scrolling. The datafile would allow me to have this done across the tree with taxonomic names. Imagine a big tree of birds presorted into orders, and labeled accordingly! Immediately you would be able to see whats going on and begin the actual biological interpretation of your data.

There are 2 or 3 programs I am aware of that (almost) do all the above. In other posts I will discuss them, and how I’m currently using them for large scale phylogenetics and informatics. My favourites at the moment are ARB and Treedyn. There is a list of tree viewers at the Treedyn site that seems quite good, perhaps getting a little old now though.

I’ll describe my thoughts on current software, pros and cons, and “the future” in an upcoming posting.