Mar 082008

There’s an interesting paper in Nature on animal phylogeny. The authors take an EST approach adding multi-gene data for 11 new animal phyla and increasing data for others.

Dunn et al. “Broad phylogenomic sampling improves resolution of the animal tree of life” Nature. Published online 5 March 2008 doi:10.1038/nature06614

Abstract: Long-held ideas regarding the evolutionary relationships among animals have recently been upended by sometimes controversial hypotheses based largely on insights from molecular data. These new hypotheses include a clade of moulting animals (Ecdysozoa) and the close relationship of the lophophorates to molluscs and annelids (Lophotrochozoa). Many relationships remain disputed, including those that are required to polarize key features of character evolution, and support for deep nodes is often low. Phylogenomic approaches, which use data from many genes, have shown promise for resolving deep animal relationships, but are hindered by a lack of data from many important groups. Here we report a total of 39.9 Mb of expressed sequence tags from 29 animals belonging to 21 phyla, including 11 phyla previously lacking genomic or expressed-sequence-tag data. Analysed in combination with existing sequences, our data reinforce several previously identified clades that split deeply in the animal tree (including Protostomia, Ecdysozoa and Lophotrochozoa), unambiguously resolve multiple long-standing issues for which there was strong conflicting support in earlier studies with less data (such as velvet worms rather than tardigrades as the sister group of arthropods), and provide molecular support for the monophyly of molluscs, a group long recognized by morphologists. In addition, we find strong support for several new hypotheses. These include a clade that unites annelids (including sipunculans and echiurans) with nemerteans, phoronids and brachiopods, molluscs as sister to that assemblage, and the placement of ctenophores as the earliest diverging extant multicellular animals. A single origin of spiral cleavage (with subsequent losses) is inferred from well-supported nodes. Many relationships between a stable subset of taxa find strong support, and a diminishing number of lineages remain recalcitrant to placement on the tree.

They show that more data increases support for some nodes and resolves ambiguities at others. More data is always a good thing I guess. Unless that is you are being tortured with systematic biases. I notice that they don’t include either Caenorhabditis species. There have been discussions before that rhabditid sequences may be atypical of most nematodes and lead to long branch artifacts. Most of this involves rDNA sequences which seem to evolve quickly in C.elegans and relatives. The other influence of C. elegans on systematics is the still ongoing debate about the validity of ecdysozoa. Papers to me seem to fall into two broad camps. (1) Those that take a small number of taxa for a lot of genes (whole genomes). These usually reject ecdysozoa. (2) Those that take more taxa, necessarily with fewer loci. These tend to support ecdysozoa. It seems that class 2 can incorporate more than a hundred loci and still support ecdysozoa, suggesting that a lack of signal is not to blame (Philippe et al 2005). This seems to indicate the presence of systematic biases in class 1 datasets and the finger of blame points to C. elegans. It has been suggested that there are fast and slow nematodes and slow are much better to include in phylogenies of animals (Aguinaldo et al 1997).

The authors suggest that for some taxa whose positions are unstable in their tree (Rotifera, Bryozoa, Gnathostomulida) “improved taxon sampling may be the most promising strategy for resolving their positions”. I’ve worked on 2 of these 3 phyla and most people have never even heard of them. Perhaps much more attention needs to be paid to less familiar animal groups. Animals selected for sequencing partly by smallness of genomes (C. elegans, D. melanogaster?) are often not suitable to describe genome structure in all animals (e.g. Raible et al 2005).

Aguinaldo et al. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature (1997) vol. 387 (6632) pp. 489-93

Philippe et al. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol Biol Evol (2005) vol. 22 (5) pp. 1246-53

Raible et al. Vertebrate-type intron-rich genes in the marine annelid Platynereis dumerilii. Science (2005) vol. 310 (5752) pp. 1325-6

Feb 292008

I came across SupraMap today. This is a way to overlay phylogenetic trees onto Google Earth images and examine the geographic distribution of the OTUs. Although there have been several postings before at iPhylo, and CIPRES and an implementation for Mesquite, the work at SupraMap looks quite polished. There is a video, and detailed instructions on how to build your own. I haven’t tried it yet, but it looks reasonably straight forward.

I like that they have clearly thought about incorporating data into the trees (different hosts in different colours) and what looks like a clickable interface to get more information. It also looks like an expanding project, I came across them when they were advertising for new programmers on EvolDir.

Feb 262008

ARB is a database program for sequence data, alignments and trees. It is primarily used by the microbial rDNA community, although it is equally powerful for other genes and taxonomic groups. ARB is my primary productivity software for phylogenetics and I thought I would introduce it briefly.

The ARB software is a graphically oriented package comprising various tools for sequence database handling and data analysis. A central database of processed (aligned) sequences and any type of additional data linked to the respective sequence entries is structured according to phylogeny or other user defined criteria.[]
Although it has some irritations, and took me a little effort to install and learn, ARB is the most powerful phylogenetic environment currently available. Yes, there are some great phylogenetic inference softwares (I like RAxML, PhyML) but that isn’t the same thing at all. This is an environment for understanding sequence data and associated information in a phylogenetic context, not just inferring a good tree. My workflow runs something like this-
  • Search GenBank for sequences from taxonomic group of interest.
  • Import entire GenBank records into ARB (add my own sequences)
  • In ARB, align using Clustal and build quick NJ tree
  • Use ARB to add group names (e.g. “Nematoda”) from GenBank/EBI taxonomy or use my own group names (e.g. “very small worms”)
  • Check alignment belonging to “weird looking” phylogenetic groups and branch lengths, edit where necessary
  • Export alignment as newick file and build a good ML tree in RAxML
  • Re-import ML tree to ARB and transfer group names from previous annotated tree
  • Ponder
I find it incredibly useful to have the full GenBank record available by clicking on the tips of the tree, and user-defined tip names (such as “Genus species isolate_source accession_number”) drawing on info from the full record. ARB deals very well with tens of thousands of sequences. The alignment editor groups together sequences that have been grouped in the tree. Because these have an editable consensus sequence that propagates to all contained sequences it is feasible to quickly check and edit thousands of sequences.

Can you imagine having 10,000 sequences in an alignment editor? How long would it take you to check the alignment and make minor corrections? What about the tree of those 10,000 sequences? Are the names sensible? Can you include the accession numbers, or contract the genus name to a single letter without rebuilding the tree? How long does it take to scroll through your tree? What taxonomic info are you actually seeing when you scroll, is it just a blur of names, are you just relying on your memory of what species you are looking at or is there node labelling and group collapsing to help?
Although ARB is far from perfect it is powerful and well designed. I don’t really see alternatives out there for dealing with lots of sequences and keeping all the data about those OTUs accesible. Its what I’m going to be using to explore building (and understanding) trees with lots of tips.

I’m actually using a very old version at the moment. There was a new version released in December 07, but I’m waiting for my new machine to arrive before I install it. I’m looking forward to checking it out.

Ludwig et al. (2004) ARB: a software environment for sequence data. Nucleic Acids Research. 32(4):1363-1371. doi:10.1093/nar/gkh293
Feb 212008

Following some links on other blogs I’ve recently seen an excellent article by T. Ryan Gregory called “Understanding Evolutionary Trees”. It introduces and explains evolutionary (phylogenetic) trees and highlights the importance of tree thinking. It even has a section on how NOT to read evolutionary trees, outlining common misunderstandings, misconceptions and misinterpretations. I wish I had had this available a few months ago when I was giving my introductory phylogenetics lectures. I think this will be compulsory reading for my students in next year’s course.

T. Ryan Gregory (2008) Understanding Evolutionary Trees. Evolution: Education and Outreach doi:10.1007/s12052-008-0035-x

Abstract: Charles Darwin sketched his first evolutionary tree in 1837, and trees have remained a central metaphor in evolutionary biology up to the present. Today, phylogenetics—the science of constructing and evaluating hypotheses about historical patterns of descent in the form of evolutionary trees—has become pervasive within and increasingly outside evolutionary biology. Fostering skills in “tree thinking” is therefore a critical component of biological education. Conversely, misconceptions about evolutionary trees can be very detrimental to one’s understanding of the patterns and processes that have occurred in the history of life. This paper provides a basic introduction to evolutionary trees, including some guidelines for how and how not to read them. Ten of the most common misconceptions about evolutionary trees and their implications for understanding evolution are addressed.

Feb 042008

What abilities should a phylogenetic visualisation tool have? What is important when you have so many tips (OTUs) that it is too big to print out or even scroll through on the screen? I have several pieces of research in this last category. In no particular order here are some things that seem important to me-

  1. It should still be “snappy” when dealing with tens of thousands of OTUs. I think it should be standalone not web-based for tasks like this.
  2. It should be open-source with an active development community. Can we really keep relying on single program authors for development? No
  3. It must interact with an associated data file. This data file can be common to a number of trees. It could be parsed from GenBank and keep ALL field data plus user data. This data file is essential for data-driven OTU renaming, searching, collapsing and exporting
  4. It should collapse OTUs to groups from an associated data file and name these groups. ie automatically group OTUs into “mammalia”, “rotifera”, “arthropoda”, “diptera”. Collapse and name options could be parsed from GenBank taxonomy. See GRUNT.
  5. It should be able to collapse nodes automatically to form polytomies. These could be clades below a given support value, or below a certain node length.
  6. It should be able to reroot. User-defined clicking on an OTU or clade, midpoint rooting (default)
  7. It should be able to test for monophyly of groups. It could colour these groups accordingly. So if all descendent taxa of a node are called mammalia in the taxonomy file then the group is labeled “mammalia”. If another mammal is found outside of mammalia clade then it is flagged as non-monophyletic.
  8. Should be able to see both the details and the whole picture. At the least click to zoom in and out . So maybe an inset of where in the tree one is and a clickable interface to go somewhere else, is vital. See Rod Page’s ideas on visualisation of large trees on a web page.
  9. It needs to have search facilities. These should be able to search tree and associated data files. Boolean. Find this text string in these fields AND this in that.
  10. User definable tip names. It should be easy to switch between different tip names (taken from the data file), such as accession number, species name, etc etc. Should be able to apply rules to this; if this and that then name tip like this.
  11. It must be able to export reliably, in all tree formats, with appropriately considered tip names etc. As graphics with SVG, PDF, EMF etc supported. Exported graphics must be available in collapsed format too.
  12. It should be scriptable. Its very useful to have the ability to be incorporated in bioinformatics pipeline. So “program open treefile, collapse according to this datafile and criteria, rename tips according to this, export as SVG”.

Am I asking a lot? Not really, all this can be implemented with current code, people just don’t in general. Any suggestions for more? Any stuff you don’t agree with?

Many programs claim to deal with hundreds or thousands of tips on a tree. My cichlid mtDNA tree has approx 4000 OTUs. The NJ tree would, if printed out, fill more than 40 pages. There are several programs that can deal with this and feel reasonably fast, but it is almost impossible to get a meaningful look at the phylogenetic relationships. Too much data on the screen, I can’t see the wood for the trees. It is essential to be able to collapse down the hundreds of almost identical mtDNA sequences coming from Lake Victoria fish and just label the resulting triangle “Victoria Superflock”. Immediately I can start to see their relationship to others without an enormous amount of scrolling. The datafile would allow me to have this done across the tree with taxonomic names. Imagine a big tree of birds presorted into orders, and labeled accordingly! Immediately you would be able to see whats going on and begin the actual biological interpretation of your data.

There are 2 or 3 programs I am aware of that (almost) do all the above. In other posts I will discuss them, and how I’m currently using them for large scale phylogenetics and informatics. My favourites at the moment are ARB and Treedyn. There is a list of tree viewers at the Treedyn site that seems quite good, perhaps getting a little old now though.

I’ll describe my thoughts on current software, pros and cons, and “the future” in an upcoming posting.

Jan 312008

Apologies to Jules Verne, but on 30.10.07 there were 20,363 Metazoan SSU rDNA accession numbers in the Silva 92 database. My immediate questions- can a reliable phylogeny be constructed from this by a biologist with average computing resources? Can the resulting tree be informative (or is it too big to get any meaningful information onto a single computer screen)? I think the answer to both is going to be “yes”.

[SILVA92; All domains of life (504,295); Eukarya (73,041)].

The (EMBL) taxonomic breakdown of these Metazoan sequences goes something like this…

48 Acanthocephala; 920 Annelida; 8833 Arthropoda; 27 Bilateria incertae sedis; 57 Brachiopoda; 24 Bryozoa; 30 Chaetognatha; 703 Chordata; 447 Cnidaria; 35 Ctenophora; 21 Cycliophora; 233 Echinodermata; 6 Echiura; 4 Entoprocta; 71 environmental samples; 52 Gastrotricha; 11 Hemichordata; 5 Kinorhyncha; 1 Loricifera; 5 Mesozoa; 2 Micrognathozoa; 1339 Mollusca; 386 Myxozoa; 45 Myzostomida; 4517 Nematoda; 18 Nematomorpha; 64 Nemertea; 7 Onychophora; 8 Placozoa; 1449 Platyhelminthes; 39 Pogonophora; 215 Porifera; 18 Priapulida; 78 Rotifera; 87 Sipuncula; 556 Tardigrada; 2 Xenoturbellida;

Note that the figure is made with a log scale! There are enormous differences in the sampling effort of different phyla with Arthropoda having almost 9000 sequences and Loricifera just one. These SSU rDNA sequences are aligned using a structural model, and could be an amazing resource for phylogenetics.
But there are of course some problems with a dataset of this size, and this is a topic (problems and how to solve or avoid them) I will probably keep returning to.

Why do phylogenetics on 20k sequences?
“Why not just sub-sample a few sequences from each group and do a ‘normal’ tree?”
Well other people have done this before. I think there are a number of issues with this and there are several advantages to “wide and deep” taxonomic sampling in both method and concept. Details will have to wait for another time, but I think it is important and definitely the way people will prefer to go (if it is clear it can be done in a straightforward manner).
This exercise isn’t primarily to redo metazoan taxonomy using a single gene, it is to explore techniques, and determine what is important in taking this approach.

How can we do phylogenetics on 20k sequences?
My primary goal is to be able to use Silva 20k SSU tree for metazoa, construct a decent (and well-annotated) tree, and browse relationships between taxa in an informative way.
I have been playing with tree building approaches on large datasets already, using both minimum evolution and maximum likelihood. More posts on this later. At the moment I have been concentrating on a subset (about 5000 sequences), until I get my thoughts together, and overcome some issues. In summary though, yes it is possible to build trees on this scale on a desktop computer.

What is a “big” tree?
There is no meaningful definition of ‘big’. In the literature when people write “large phylogenetic trees” they have in the past meant 100 taxa, in the recent past maybe 1000 taxa, people now (occasionally) talk about several thousand or more. Obviously the definition changes with time. I just mean a tree too big for the standard software and approaches to deal with comfortably. So that’s more than 1000 (although most software struggles in one way or another with more than about 200 taxa!).

Once you have a tree, how can it be visualised in order to actually learn something?
This is actually a real problem in phylogenetic biology. Viewing very large trees is the topic of an upcoming blog.

I’ve been thinking about constructing, processing and visualising (or extracting info from) big trees quite a bit and have developed some strategies that are very useful to me, and might be to other people too. (I have been routinely dealing with phylogenies of 3-5000 sequences (about 1500bp alignments) on a fairly low spec computer. This post is really to introduce the dataset and the SILVA site and my vague aims.