Feb 112009
 

The New York Times has an article talking about constructing and especially visualizing the tree of life called “Crunching the Data for the Tree of Life“. Its interesting, especially since I think it touches on many issues concerning tree size that even phylogenetic biologists haven’t really considered. There are lots of talk of “big” trees, sometimes only a few thousand OTUs, and a new tree of plants containing 13,533 species[1]. Carl Zimmer over on the Loom writes that this is the biggest tree he knows of. It might be the biggest published tree I know of too, but Morgan Price on the FastTree site has a 16S rDNA tree to download containing “186,743 distinct sequences”. Its 48MB when compressed. It will be interesting to hear of strategies to visualize a tree of this size while still mantaining associated information. The temptation I’m sure will be just to make it pretty, but not ultimately very useful. ARB can display trees this size (I think) although I still haven’t got to grips with automated collapsing and labelling of groups yet.

The Smith paper looks really interesting, but I’ve only had chance to skim it so far.

[1] Stephen A Smith , Jeremy M Beaulieu and Michael J Donoghue
Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches
BMC Evolutionary Biology 2009, 9:37 doi:10.1186/1471-2148-9-37

Oct 292008
 

I downloaded some datasets from the SILVA96 database. These are structurally aligned SSU rDNA sequences. I browsed through the taxonomic groups and chose annelids (N=1050) and nematodes (N=5048) as smallish tests. I downloaded these as fasta files.

I started with the annelids file. The file contain a LOT of gaps, because it comes from an alignment of hundreds of thousands of sequences of all three domains of life.
I haven’t yet found a good way to process large files to remove columns that are all gaps. It can be done in Clustal and Mesquite but these are bad choices with very large alignments. There are some online resources but my fasta files are >50-250MB, so online is not the place even if I could persuade a server to upload my files. I should really have used BioPerl SimpleAlign to remove gap columns, its probably the most flexible and able to deal with big files, but I was temporarily having trouble installing BioPerl on my desktop (a future post) and ran out of time and patience.

I ran it through Gblocks instead which does more than just remove blank columns, also trimming areas of poor alignment judge by various criteria. This reduced the file considerably.

I had previously installed FastTree, so I ran it with the command

fasttree -nt annelids.fasta >annelids.tree

It ran quite nicely and produced a viable tree.
Something strange with the timings though.

Topology done after 1242.20 sec -- computing support values
Unique: 3137/5048 Bad splits: 37/3134 Hill-climb: 259 Update-best: 11335 NNI: 4149
Top hits: close neighbors 2510/3137 refreshes 176
Time 1577.05 Distances per N*N: by-profile 0.220 (out 0.065) by-leaf 0.291
END:    2008-10-28 18:23:32
----------------------------------------------
Runtime:     5886         seconds
Runtime:     01:38:06     h:m:s
----------------------------------------------

The text starting with “END:” is the output of my perl script before that from fasttree. So fasttree claims to have taken 1577 seconds (26 minutes) but my script times it at 1 hour 38 minutes. I actually noted the time it started and it did take 1 hour 38 mins. I repeated with identical results. Strange discrepancy.

Oct 292008
 

The prediction I made before about a long silence once this year’s students turned up was sadly accurate. Anyway, students dealt with, grant proposal submitted, lectures (mostly) given, bureaucracy reduced (a bit), time to get on with some phylogenetics.

I was playing before with FastTree. Although it looks to have been quite well tested by its developers its always worth investigating real-world experiences. Last time I put together a little perl script to time the runs, but then I noticed that it actually reports the number of seconds taken. Hmmm did I just miss that all along or did it appear in version 1.0 and I didn’t notice. Oh well, either way I must pay more attention.

Morgan Price (the developer of FastTree) has released a version a few days ago that should compile without the malloc error I talked about before.

Sep 262008
 

In order to to see how quickly FastTree runs for me I need some automated method of timing it. While some programs like phyML return a runtime at the end FastTree doesn’t seem to. So I searched the web and found bits of perl code to put a script timer together.

I have uploaded the relevant script (posixtime.pl) to my repository website. It seems to work well for me but test it for yourself. Since it is based on POSIX I think it will only run on *nix systems (like Linux and MacOSX) although it can be made to work on Windows too perhaps (see here).

It has some placeholder code that prints out stuff and reports back how long it has taken. All the section between # Script goes here # and # Script ends here # can be deleted and replaced with the appropriate commands. So to run FastTree include a line like this-
system (FastTree -nt alignment_file > tree_file);

The system command however is also *nix specific I think. Sorry Microsoft guys I’ve never run perl on a Windows machine. There must be an equivalent way to launch external programs if you are working in e.g. ActivePerl.

I think the above command without ./ prefix depends on FastTree being installed in the correct location. This is usr/bin on my system.
Type the command: which perl
You should probably get: /usr/bin/perl
Move to the level above: cd /usr/bin
Copy the FastTree application to here: sudo cp path_to_application ./
Enter password when asked. If you don’t want to type out the path to the application you can (in OSX) just drag and drop the application into the terminal window after you have typed the sudo cp part, and it will paste in the location of the file you have dropped.

You should then be able to launch FastTree by giving the FastTree command wherever you are, without having to cd and move to the directory containing the application.


I have posted the FastTree application I described in the previous post at my file repository site, in case you don’t want to install developer tools and mess around with malloc errors.

Sep 262008
 

This is how I downloaded, compiled and got FastTree working. Its a bit obvious in places but I think detailed instructions are a good thing to have out there and Google findable. I am using a multicore MacPro 2.8GHz with 4GB RAM and OSX 10.5.4 (I’m not sure the 8 cores make any difference whatsoever if the code isn’t written to take account of them).

  • I downloaded FastTree from www.microbesonline.org/fasttree.
  • I had to use Safari to do this as Firefox wouldn’t let me right click and download.
  • FastTree 1.0.0 is available as binaries for Windows and Linux. Unfortunately there are no Mac binaries
  • I downloaded the C code file (156kb)
  • I installed the developer tools “XcodeTools.mpkg” from system software install DVD number 2. It took about 15 minutes. This allowed me to use the gcc compiler to actually make the application.
  • I opened terminal, moved to the location of the c file and issued the command: gcc -lm -O2 -Wall -o FastTree FastTree.c
  • It didn’t like that much and gave the error: FastTree.c:212:20: error: malloc.h: No such file or directory

After some Google searching I came across a couple of indications that malloc.h (I had no idea what this was) was outdated.

malloc.h not supported, use stdlib.h (http://developer.apple.com/technotes/tn2002/tn2071.html)

Most helpful was this:

Mac OS X for unix geeks
http://unix.compufutura.com/mac/ch05_01.htm
5.1.2. malloc.h

make may fail in compiling some types of Unix software if it cannot find malloc.h. Software designed for older Unix systems may expect to find this header file in /usr/include; however, malloc.h is not present in this directory. The set of malloc( ) function prototypes is actually found in stdlib.h. For portability, your programs should include stdlib.h instead of malloc.h. (This is the norm; systems that require you to use malloc.h are the rare exception these days.) GNU autoconf will detect systems that require malloc.h and define the HAVE_MALLOC_H macro. If you do not use GNU autoconf, you will need to detect this case on your own and set the macro accordingly. You can handle such cases with this code:

#include
#ifdef HAVE_MALLOC_H
#include
#endif

  • So I opened the C file in a text editor and searched for malloc.h, I found it on line 212. I then deleted that line (#include {malloc.h}) and inserted the four lines from above.
  • I repeated the gcc command to build the application from above and it worked. No errors and it produced the application in 1 second.
  • I tried to launch it and display the help file using the terminal command: ./FastTree -h
  • It didn’t work at all, so for no logical reason I just dumped the application and rebuilt it with the same gcc command. This time ./FastTree -h did launch the application and it displayed the help file. Success!
  • Using the simple instructions from the FastTree page I tried to run it on some lizard DNA sequences I had lying around. These were fasta sequences, although it claims to work on phylip also. The command was: ./FastTree -nt lizards.fasta > lizards.tre
  • It produced a tree with only the first taxon name and nothing else. The input file had mac line endings though and when I corrected that to unix it was fine.
  • I noticed that some of the names were truncated and wondered if they had been chopped at spaces. I replaced with underscores and got better results. [In the spirit of full disclosure these last two points were with the previous version of FastTree and I didn't try to replicate these errors with version 1.0.0]
  • Another problem I had with the previous version was when I accidentally had a repeated taxon in the matrix. It complained about a “non unique name X in the alignment” and wrote an empty treefile.
  • Having got around these teething problems it ran perfectly. Almost instantly writing a treefile (input fasta N=112, 923bp). The tree looked quite sensible and it had support values (local bootstraps). These are on a 0 to 1 scale, so 0.95 is 95%

Fairly straight forward (except for the malloc error) on the whole. Next I’m going to report on some bigger runs and start timing them properly. My immediate goal is to get a tree of all ~20k metazoan 18s rDNA sequences. Of course a tree of that size will bring its own problems, how to visualize it.

Sep 262008
 

So when I started writing this blog I thought I would use it to outline some of the things I was working on as I went along. Not real projects, which I will write up and publish, but side projects and how I got them to work (or otherwise).¬†Unfortunately there hasn’t been much of that, lots of reviews and comment instead. Hopefully I am going to change that now.¬†

Last April I posted about FastTree, the rapid NJ application that seems to be able to handle approx 40,000 OTUs. The authors say 
“FastTree inferred a phylogeny for an alignment of 39,092 proteins, including support values, in half an hour on a desktop PC”.

I actually compiled and started using it back in April, but got swamped by teaching and theses and stuff and never posted anything more. My next few posts are going to be about getting FastTree to work and how it copes with some of my datasets. Alternatively as the new crop of undergraduates arrive next week there may be a long silence instead.