Oct 172014

ReproPhylo treeWe are still largely missing the benefits of reproducibility in phylogenetics. I think that this makes our lives unnecessarily difficult and makes us particularly poorly prepared to confront modern data-rich phylogenetics. In this first post “Why” I want to talk about why we need reproducible phylogenetics. Then, in part two, “What“, I’m going to talk about some possible approaches to reproducible phylogenetics. In part three, “How“, I’m going to look at some existing software solutions. Lastly, in “ReproPhylo“, I’m going to write about the work my lab is doing to bring these approaches together to create new reproducible phylogenetics solutions.

tldr; Reproducibility helps us do better phylogenetics and do it more easily. There are a number of partial solutions out there. We introduce the ReproPhylo framework for easier + better reproducible experimental phylogenetics.

Why do we need reproducible phylogenetics?

Three short answers, then I explain below.

Answer 1: to do better quality science. This is achieved by being able to build on and extend other people’s work. It is also achieved by being able to take an experimental approach to phylogenetic methodologies.

Answer 2: to make your life much easier. The person most likely to reproduce your work is “future you”, making it easy to reproduce, and then modify, your analysis will save you lots of time. It will help you to do more actual research and less reformatting of files and coaxing belligerent applications to read them.

Answer 3: If your work isn’t fully reproducible is it really science? Sure its nice work that clarifies some important issues, you’re a bright person and its likely correct, but if its not reproducible …. what the hell are you thinking? Is this why you got into science? To accept stories from other scientists based on them being bright and it sounding right? You are much more sceptical than that about the science you read, so shouldn’t people also be sceptical of your work. Yeah, exactly.

Standing on the shoulders of giants

The ability to extend the work of others, to stand on their ‘shoulders’ [1] and reach higher, is how progress is made. “Wow, if I just added these species to that tree, used their analytical approach, I could actually test [whatever]”. But can you add species and use that approach? Or do you have to start from scratch, collecting sequences from GenBank and trying to reproduce their work before extending it?

How much do you want your work to have impact going forward? Make it easy for people to extend your work and you will be influential.

‘Experimental’ phylogenetics

This refers to approaches where we test the influence of method, parameter choice, and data inclusion on our tree structure. How many studies have you seen where people explore parameter space exhaustively and explicitly compare the phylogenies produced. Not many I would guess. Any? The reason is that it is too difficult to experiment when manual approaches to phylogenetics are used. Have you ever experimented with alignment parameters in your MSA program of choice? Most phylogeneticists usually only run the defaults, you can check the Methods sections of papers to confirm this. If I have to align sequences with 6 different programs, each with 50 different combinations of parameters, and then compare some characteristics of the 300 alignments and resulting trees this will be a truly mammoth piece of work. If however I have an entirely reproducible pipeline that will iterate over parameter space and produce a detailed report with clear summaries of alignment characteristics and tree variability then this becomes not an exceptional piece of work but just something I would typically do before getting down to detailed analysis of the question at hand. If a reviewer or critic thinks I have chosen the wrong range of parameters to optimise they can simply add others and hit RUN on my pipeline to compare to my optimum values. The robustness of science improves.

Who will actually reproduce my work?

It could be anyone. What if Professor Big loves your work and wants to extend it, great! But the reality is that the person who will certainly need to reproduce and extend your work is Future You! Make life easy for yourself by starting out reproducibly, anyone else calling you a giant and wanting to stand on your shoulders is a bonus.

What makes you think phylogenetics isn’t pretty much OK now?

Long and painful experience makes me think that. Why don’t you try reproducing 3 phylogenetics results from papers and then your perspective will have changed. Can you get their data easily, or is it a list of Genbank Identifiers in a supplementary Word table that you then have to type in to NCBI website? Can you run their software? If its an old paper can you even find their software? Do you know the dependencies to run it? Do you know what version they ran? Do you know the parameters they used?  Are the default parameters now the same as then? Did they exactly record data transformations? Maybe they changed sequence names between the original files and the tree figure. Maybe some taxa were excluded as problematic. Maybe the alignment was manually improved. Maybe some alignment regions were masked. All that is fine, but do you know exactly what they did and how? Did they archive the final tree or only a picture of it? This would only allow you compare by visual inspection to see if you have reproduced a previous study. It is estimated that >60% of phylogenetics studies are ‘lost to science’ [2]. This is a problem.

What is Reproducibility?

I’m not going to cover the semantic differences between reproducibility, replication, repeatability etc. Here I take a practical view of reproducibility as a term used routinely to represent the above terms. I really like this video of Carole Goble explaining the concepts of reproducible research.

Reproducibility is the correct way to do science.

Reproducibility is so integral to what we consider the scientific process that it is hard even to make a counter case here, so I won’t really try. So why isn’t reproducibility the norm? Well a technically poor form of reproducibility is the norm, the Methods section of the journal article. Later in this series I suggest that technical challenges have prevented complete and efficient reproducibility in the past, it hasn’t been your fault, but now those challenges are pretty much solved (part3; How) and we should grasp and benefit from these new possibilities.

[1] Standing on the shoulders of giants. Wikipedia. Available: http://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants.

[2] Magee AF, May MR, Moore BR: The Dawn of Open Access to Phylogenetic Data. arXiv [q-bioPE] 2014. http://arxiv.org/abs/1405.6623

Jan 242014


I’m pretty proud of some parts of my workflow: electronic lab notebook, reproducibility, open data files, (semi-obsessive) automated data backups etc etc. But pride often comes before a fall. I had a bad experience this week where I thought I had lost some important phylogenetic data files (I found them eventually), and I’m writing this to work through what I did wrong, and what I need to change in my work routine.

About 4 years ago I built a phylogenetic tree of some cichlid fish. It was a small, relatively simple analysis, just to describe the phylogenetic relationship between some test fish species in a collaborative genomic experiment.  The pretty picture of the phylogeny was a manuscript figure, the analysis had been written up in my ELN as I did it, data files were backed up, no need to worry. Time passed, the paper stalled, and then went through tediously long review, but now I need to submit the files to Dryad and TreeBase ASAP. Fortunately I have the original sequence alignments, the notes on the analysis, and the tree file. Or do I?

This was a few years ago and the way I do phylogenies has changed quite a bit since then (subject of another post). My problem was that I had carried out three similar cichlid projects around the same time. Three cichlid phylogenies using the same genes to address three different questions. I had lots of iterations of file analyses for each data set as I iterated through analysis parameters and approaches. I had made at least two errors:

Sloppy naming

I had been very unimaginative in my file and folder naming schemes. Lots of things called cich_phylo1 or Bayes_new or cichlid_ND2. Lots of almost identical files in nested folders to search in order to find the one that had generated the figure. Someone I once worked with had the strategy of creating enormously long filenames detailing the analyses that had generated the file. It was impressive but I’m not sure I could do it, though it might work well for generated rather than manually created files. Maybe I’ll adopt it for folder names?

Full file paths

But surely if I had a good experimental record it would identify the exact file I was dealing with. Yes, it does, though I shamefully often neglected to give the full file path. It shouldn’t have been a problem but the analysis was actually done in a very unusual location on my HD (a folder shared with a linux partition) and falling outside my search. The reason I failed to give full file paths was that I was working fast, to a short deadline, and running more than one analysis simultaneously. I was pasting data details, ML parameters, results, and next moves, into my lab book but I was doing it all too fast. If I had an easy way to insert full paths I might have used it, but I didn’t. In OSX you can create shortcuts to copy the full file path and this is now only a right click away, so I have no excuse for not saving full paths into my ELN now.

NB, file modification dates do not always persist

Also, although I had the creation date of the files from my ELN, I wasn’t confident that the dates still persisted. OSX changes modification dates sometimes for no reason. Open a file, read, close- bam! its folder now has today’s date. Also some backups I did a few years ago used ftp (by accident) obliterating the file dates. Although the files in question did have the right dates, I had realised early on in my search that I couldn’t rely on that.

What would have helped?

  • Full file paths
  • Unique descriptive names, of folders if not files as well
  • Annotation of ELN blog posts with a “submitted file” tag or some way to differentiate experimental iterations from the final thing.
  • Writing a text file for the final folder detailing the publication of files (effectively tagging the final version for system searches)
  • Uploading final files immediately to Figshare. This would have given them each a doi, and that could have be put in the figure legend immediately. Useful.


This all speaks to a bigger issue- reproducibility. In documenting all my analyses I was relying entirely on myself and my ‘good behaviour’. But one bad day, one error, one interruption by a student just as you are writing something, and the record is lost. This is not an issue however with workflows that automatically generate reports along with the results and figures. Here the generation of a complete and detailed record never fails, as long as the script is working correctly. This is the way I now try to do things.

My final point would be: you don’t have a real backup system until you’ve tested it in the real world. A thought experiment of how you would recover isn’t going to be enough. I thought I was fine, but I nearly wasn’t.

Aug 272012

I’ve been reading a lot recently about reproducible research (RR) in bioinformatics on several blogs, and Google+ and Twitter. The idea is that it is important that someone is easily able to reproduce* your results (and even figures) from your publication using your provided code and data. I’ve been thinking that this is a movement that urgently needs to spread to phylogenetics research- Reproducible Research in Phylogenetics or RRphylo.

The current state of affairs

The problem is that although methods sections of phylogenetics papers are typically fairly clear, and probably provide all the information required to pretty much replicate the work, it would be a very time-consuming process- lots of hands on time, lots of manual data manipulation. Moreover, many of the settings are assumed (defaults) rather than explicitly specified. ‘Have I replicated what they did?’ would then be judged by a qualitative assessment of whether your tree looked like the one published, and thats not really good enough.

Why it matters

I’m assuming here that what is currently published will allow replication, though in some cases it might not, as Ross Mounce described in his blog. I have had experience of this too, but thats another long and painful story. So why does it matter? It matters because of the next step in research. If the result of your work is important you or someone else will probably want to add genes or species to the analysis, vary the parameters of phylogenetic reconstruction, or alignment, or the model of sequence evolution, or reassess the support values, or something else relevant. Unless the process of analysis has been explicitly characterised in a way that allows replication without extensive manual intervention and guesswork then this cannot be achieved. If you want your work to be a foundation for rapid advancement, rather than just a nice observation, then it must be done reproducibly.

What should be done?

Briefly; pipelines, workflows and/or script-based automation. It is quite possible to create a script-based workflow that recreates your analyses in their entirety, perhaps using one of the Open-Bio projects (e.g. BioPerl, BioPython) or DendroPy, or Hal. This would go from sequence download, manipulation and alignment (though this could be replaced by provision of an alignment file), to phylogenetic analysis, to tree annotation and visualisation. Such a script must, by definition, include all the parameters used at each step- and that is mostly why I prefer that it includes sequence retrieval and manipulation rather than an alignment file.

Phylogeneticists are sometimes much less comfortable with scripting than are bioinformaticians, but this is something that has no option but to change. The scale of phylogenomic data now appearing just cannot (and should not) be handled in the GUI packages that have typically enabled tree building (e.g. MEGA, DAMBE, Mesquite).

There is a certain attraction to GUI approaches though and something that may well increase are GUI workflow builders. The figure above is from Armadillo, which looks interesting, but unfortunately doesn’t seem to be able to save workflows in an accessible form, making it an inappropriate way forward. Galaxy is another good example, able to save standard workflows, but not (yet) well provisioned for phylogenetic analyses.

At the moment  a script-based approach linking together analyses is the best approach for RRphylo.

‘So you’ve been doing this, right?’

Err no, I’ve been bad, really I have. Some of my publications are very poor from the view of RRphylo. The reasons for that would take too long to go into, I’m not claiming the moral high ground here, but past mistakes do give me an appreciation of both the need and the challenges in implementation. I absolutely see that this is how I will have to do things in future, not just because (I hope) new community standards will demand it, but also because iterating over minor modifications of analysis is how good phylogenetics is done, and that is best implemented in an automated way.

Automated Methods sections for manuscripts?

One interesting idea, that is both convenient and rigorous, is to have the analysis pipeline write the first draft of your Methods section for you. An RRphylo script that fetched sequences from genbank, aligned them, built a phylogeny, and then annotated a tree figure should be able to describe this itself in a human-readable text format suitable for your manuscript.

The full Methods pipeline is archived at doi:12345678 and is described briefly as follows: sequences were downloaded from NCBI nucleotide database (27 August 2012) with the Entrez query cichlidae[ORGN] AND d-loop[TITL] and 300:1600[SLEN]. These 5,611 sequences were aligned with MUSCLE v1.2.3 with settings……

This is a brief and made up Methods that doesn’t contain the detail it could. Parameter values from the script have been inserted in blue. This sort of an output could also fit very well with the MIAPA project (Minimum Information About a Phylogenetic Analysis). NB it is not this methods information that needs to be distributed, it is the script that carried out these analyses (and produced this human-readable summary as an extra output) that is the real record of distribution.


This won’t be implemented tomorrow, even if everyone immediately agrees with me that it is really important. It is much easier for most people to just write the same old methods section they always have- a general description of what they did that people in the field will understand. I went today to read a lot of Methods sections from phylogeny papers. Some were better than others in describing the important details, but none sounded relevant to the new era of large scale analysis. They sounded like historical legacies, which of course is true of scientific paper writing in general.

It will take a community embarrassment to effect a change; an embarrassment that even the best papers in the field are still vague, still passing the methodological burden to the next researcher, still amateur compared to modern bioinformatics papers, still ultimately irreproducible.

The major barrier to RRphylo is the need to write scripts, a skill with which many phylogeneticists are unfamiliar and uncomfortable. This may be helped by Galaxy or the like, allowing the easy GUI linking of phylogenetic modules and publication of a standard format workflows to MyExperiment (I think Galaxy is the future, for reasons I won’t go into here). Alternatively, maybe some cutting-edge labs will put together a lot of scripts and user-guides allowing even moderately computer literate phylogeneticists to piece together a reproducible workflow. Hal and DendroPy seem the places to start at present, and I shall have to try them out as soon as I can. Other places for workflows that are worth investigating are Ruffus, Sumatra, and Snakemake. At the moment I’ve done a decent amount of Googling and absolutely no testing so I’d be really interested in other suggestions and views on these options.

I think that Reproducible Research in Phylogenetics is incredibly important. Not just to record exactly what you did, not just to make things easier for the next researcher, but because all science should be fully reproducible- of course it should. I’m coming round to the idea that not implementing RRphylo is equivalent to not releasing your new analysis package and just describing what the program does. But maybe I’m just a lone voice?

See also 

Our approach to replication in computational science, C Titus Brown

Reproducible Research: A Bioinformatics Case Study Robert Gentleman 2004

Next-generation sequencing data interpretation: enhancing reproducibility and accessibility‘ Nekrutenko & Taylor Nature Reviews Genetics 13, 667-672 (September 2012) | doi:10.1038/nrg3305 (subscription required)

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Goeks et al Genome Biology 2010, 11:R86 doi:10.1186/gb-2010-11-8-r86 (open access)

Reproducible research with bein, BiocodersHub

Sumatra also talks about pipelines acting as an Electronic Lab Book, here’s a presentation about it.

* I am not going to distinguish between replication and reproducibility in this blog post. See here. There are differences, but both are needed.

Jul 262011

I have a project going at the moment to examine changes in intron diversity, size and location in animal genomes. I am always a bit frustrated with the way introns are treated in many genome characterisation papers- “the genome contained Y introns with mean intron size Xbp” is usually all we get. This sort of summary stat can hide all manner of interesting trends. One measure that is often useful is intron density but unfortunately there doesn’t seem to be any standardised way to use this measure. Density is often measured as ‘introns per gene’, which is a reasonable shorthand, although since genes vary in length very considerable both within and between genomes it makes quantitative analysis very difficult indeed. I have seen ‘introns per 10kb’! This is OK, but what number to choose? What if each study chooses a different number? ‘Introns per nucleotide’ will standardise this better, and although the number will be very small, we seem to manage just fine with small mutation rate numbers and the like. But the more I think about it the less simple this seems.

Introns per bond

Something that is often overlooked when calculating the number of introns per nucleotide is that introns do not insert into nucleotides but rather the phosphodiester bonds between them. I would suggest therefore that the most accurate and effective way to specify density would be introns per bond. It seems reasonable that counting nucleotides is a convenient shorthand for this, but actually this shorthand leads to small but persistent errors. This is an unfortunate consequence of genome annotation restricting itself to nucleotides but genomic processes sometimes targeting bonds.

In the cartoon of a gene above CDS represents the protein coding region and UTR stands for the 5′ and 3′ untranslated regions. The dashes between nucleotides represent phosphodiester bonds joining the nucleotides. There are 6 nucleototides in the 5′-UTR, 9 nucleotides in the CDS and 5 nucleotides in the 3′-UTR. What would happen if we were to insert an intron in this sequence at the boundary of one of the gene regions? In which gene region would it be counted?

Coding regions almost always begin with a codon specifying a methionine residue- the start codon ATG. Nothing preceding this A nucleotide is counted as part of the CDS. Coding regions finish with a termination codon, TGA in the example above. This A nucleotide is the end of the CDS. By usual practice therefore any intron inserting into the bond between the T and the A at the 5′ end of the CDS would not be counted as part of the CDS, nor would any intron inserting into the bond between the A and the A at the 3′ end of the CDS. This is quite reasonable in many ways, but defining UTRs by reference to the CDS (after the last nucleotide, before the first nucleotide) means that the CDS has one less bond per nucleotide than do the UTRs! The 5′-UTR here has 6 nucleotides and 6 bonds, the 3′-UTR has 5 nucleotides and 5 bonds, but the CDS has 9 nucleotides and only 8 bonds where an inserting intron would be labelled as a ‘CDS intron’.

Does this matter?

Both yes and no. Counts using introns/nucleotide will be very similar to introns/bond. I am not claiming that work needs to be repeated or that substantial errors put into question previous work. But there are two issues here

  1. We should do it right. Understanding the actual insertion process requires us to use the right language. We should label introns as inserting between nucleotides to avoid confusion. You may not be confused, but try writing a script that counts introns when everything is labelled by nucleotide position.
  2. We can’t yet be sure what difference correct counts make in large data sets. The age of genomics is here. We can study hundreds of thousands of introns from lots of species and treat this mass of data statistically. The numbers of introns/nucleotide and introns/bond may look similar to our eyes, but trusting our savannah ape brains to make the right call is a risky strategy with big numbers.
The lab now uses these ‘per bond’ counts in our genomic intron scripts, which will be released when the first paper is out. I think it would be great if there was a biological standard for intron density, maybe we should even give it a unit- a Gilbert perhaps could equal one intron/10-3 bonds?
This post may be cited using the DOI: http://dx.doi.org/10.6084/m9.figshare.708404 

Oct 292008

I downloaded some datasets from the SILVA96 database. These are structurally aligned SSU rDNA sequences. I browsed through the taxonomic groups and chose annelids (N=1050) and nematodes (N=5048) as smallish tests. I downloaded these as fasta files.

I started with the annelids file. The file contain a LOT of gaps, because it comes from an alignment of hundreds of thousands of sequences of all three domains of life.
I haven’t yet found a good way to process large files to remove columns that are all gaps. It can be done in Clustal and Mesquite but these are bad choices with very large alignments. There are some online resources but my fasta files are >50-250MB, so online is not the place even if I could persuade a server to upload my files. I should really have used BioPerl SimpleAlign to remove gap columns, its probably the most flexible and able to deal with big files, but I was temporarily having trouble installing BioPerl on my desktop (a future post) and ran out of time and patience.

I ran it through Gblocks instead which does more than just remove blank columns, also trimming areas of poor alignment judge by various criteria. This reduced the file considerably.

I had previously installed FastTree, so I ran it with the command

fasttree -nt annelids.fasta >annelids.tree

It ran quite nicely and produced a viable tree.
Something strange with the timings though.

Topology done after 1242.20 sec -- computing support values
Unique: 3137/5048 Bad splits: 37/3134 Hill-climb: 259 Update-best: 11335 NNI: 4149
Top hits: close neighbors 2510/3137 refreshes 176
Time 1577.05 Distances per N*N: by-profile 0.220 (out 0.065) by-leaf 0.291
END:    2008-10-28 18:23:32
Runtime:     5886         seconds
Runtime:     01:38:06     h:m:s

The text starting with “END:” is the output of my perl script before that from fasttree. So fasttree claims to have taken 1577 seconds (26 minutes) but my script times it at 1 hour 38 minutes. I actually noted the time it started and it did take 1 hour 38 mins. I repeated with identical results. Strange discrepancy.

Oct 292008

The prediction I made before about a long silence once this year’s students turned up was sadly accurate. Anyway, students dealt with, grant proposal submitted, lectures (mostly) given, bureaucracy reduced (a bit), time to get on with some phylogenetics.

I was playing before with FastTree. Although it looks to have been quite well tested by its developers its always worth investigating real-world experiences. Last time I put together a little perl script to time the runs, but then I noticed that it actually reports the number of seconds taken. Hmmm did I just miss that all along or did it appear in version 1.0 and I didn’t notice. Oh well, either way I must pay more attention.

Morgan Price (the developer of FastTree) has released a version a few days ago that should compile without the malloc error I talked about before.

Sep 262008

In order to to see how quickly FastTree runs for me I need some automated method of timing it. While some programs like phyML return a runtime at the end FastTree doesn’t seem to. So I searched the web and found bits of perl code to put a script timer together.

I have uploaded the relevant script (posixtime.pl) to my repository website. It seems to work well for me but test it for yourself. Since it is based on POSIX I think it will only run on *nix systems (like Linux and MacOSX) although it can be made to work on Windows too perhaps (see here).

It has some placeholder code that prints out stuff and reports back how long it has taken. All the section between # Script goes here # and # Script ends here # can be deleted and replaced with the appropriate commands. So to run FastTree include a line like this-
system (FastTree -nt alignment_file > tree_file);

The system command however is also *nix specific I think. Sorry Microsoft guys I’ve never run perl on a Windows machine. There must be an equivalent way to launch external programs if you are working in e.g. ActivePerl.

I think the above command without ./ prefix depends on FastTree being installed in the correct location. This is usr/bin on my system.
Type the command: which perl
You should probably get: /usr/bin/perl
Move to the level above: cd /usr/bin
Copy the FastTree application to here: sudo cp path_to_application ./
Enter password when asked. If you don’t want to type out the path to the application you can (in OSX) just drag and drop the application into the terminal window after you have typed the sudo cp part, and it will paste in the location of the file you have dropped.

You should then be able to launch FastTree by giving the FastTree command wherever you are, without having to cd and move to the directory containing the application.

I have posted the FastTree application I described in the previous post at my file repository site, in case you don’t want to install developer tools and mess around with malloc errors.

Sep 262008

This is how I downloaded, compiled and got FastTree working. Its a bit obvious in places but I think detailed instructions are a good thing to have out there and Google findable. I am using a multicore MacPro 2.8GHz with 4GB RAM and OSX 10.5.4 (I’m not sure the 8 cores make any difference whatsoever if the code isn’t written to take account of them).

  • I downloaded FastTree from www.microbesonline.org/fasttree.
  • I had to use Safari to do this as Firefox wouldn’t let me right click and download.
  • FastTree 1.0.0 is available as binaries for Windows and Linux. Unfortunately there are no Mac binaries
  • I downloaded the C code file (156kb)
  • I installed the developer tools “XcodeTools.mpkg” from system software install DVD number 2. It took about 15 minutes. This allowed me to use the gcc compiler to actually make the application.
  • I opened terminal, moved to the location of the c file and issued the command: gcc -lm -O2 -Wall -o FastTree FastTree.c
  • It didn’t like that much and gave the error: FastTree.c:212:20: error: malloc.h: No such file or directory

After some Google searching I came across a couple of indications that malloc.h (I had no idea what this was) was outdated.

malloc.h not supported, use stdlib.h (http://developer.apple.com/technotes/tn2002/tn2071.html)

Most helpful was this:

Mac OS X for unix geeks
5.1.2. malloc.h

make may fail in compiling some types of Unix software if it cannot find malloc.h. Software designed for older Unix systems may expect to find this header file in /usr/include; however, malloc.h is not present in this directory. The set of malloc( ) function prototypes is actually found in stdlib.h. For portability, your programs should include stdlib.h instead of malloc.h. (This is the norm; systems that require you to use malloc.h are the rare exception these days.) GNU autoconf will detect systems that require malloc.h and define the HAVE_MALLOC_H macro. If you do not use GNU autoconf, you will need to detect this case on your own and set the macro accordingly. You can handle such cases with this code:


  • So I opened the C file in a text editor and searched for malloc.h, I found it on line 212. I then deleted that line (#include {malloc.h}) and inserted the four lines from above.
  • I repeated the gcc command to build the application from above and it worked. No errors and it produced the application in 1 second.
  • I tried to launch it and display the help file using the terminal command: ./FastTree -h
  • It didn’t work at all, so for no logical reason I just dumped the application and rebuilt it with the same gcc command. This time ./FastTree -h did launch the application and it displayed the help file. Success!
  • Using the simple instructions from the FastTree page I tried to run it on some lizard DNA sequences I had lying around. These were fasta sequences, although it claims to work on phylip also. The command was: ./FastTree -nt lizards.fasta > lizards.tre
  • It produced a tree with only the first taxon name and nothing else. The input file had mac line endings though and when I corrected that to unix it was fine.
  • I noticed that some of the names were truncated and wondered if they had been chopped at spaces. I replaced with underscores and got better results. [In the spirit of full disclosure these last two points were with the previous version of FastTree and I didn't try to replicate these errors with version 1.0.0]
  • Another problem I had with the previous version was when I accidentally had a repeated taxon in the matrix. It complained about a “non unique name X in the alignment” and wrote an empty treefile.
  • Having got around these teething problems it ran perfectly. Almost instantly writing a treefile (input fasta N=112, 923bp). The tree looked quite sensible and it had support values (local bootstraps). These are on a 0 to 1 scale, so 0.95 is 95%

Fairly straight forward (except for the malloc error) on the whole. Next I’m going to report on some bigger runs and start timing them properly. My immediate goal is to get a tree of all ~20k metazoan 18s rDNA sequences. Of course a tree of that size will bring its own problems, how to visualize it.

Sep 262008

So when I started writing this blog I thought I would use it to outline some of the things I was working on as I went along. Not real projects, which I will write up and publish, but side projects and how I got them to work (or otherwise). Unfortunately there hasn’t been much of that, lots of reviews and comment instead. Hopefully I am going to change that now. 

Last April I posted about FastTree, the rapid NJ application that seems to be able to handle approx 40,000 OTUs. The authors say 
“FastTree inferred a phylogeny for an alignment of 39,092 proteins, including support values, in half an hour on a desktop PC”.

I actually compiled and started using it back in April, but got swamped by teaching and theses and stuff and never posted anything more. My next few posts are going to be about getting FastTree to work and how it copes with some of my datasets. Alternatively as the new crop of undergraduates arrive next week there may be a long silence instead.
Jul 072008

I’ve just had a manuscript published at BMC Evolutionary Biology so thought I’d share a synopsis.
I’ve been interested in root knot nematodes for a while as they are a powerful system for evolutionary genetics and amazingly successful parasites of plants- especially crop plants. Trudgill and Blok (2001) estimate that they “have host ranges that encompass the majority of flowering plants” and that Meloidogyne incognita “is possibly the single most damaging crop pathogen in the world”.
Interestingly, the 3 most damaging species are obligatory asexual mitotic parthenogens (apomicts), which is perhaps counter intuitive given the ongoing arms race between plants and nematode. Root Knot Nematodes have previously been suggested to be ancient asexuals, and I’m interested in the consequences of asexuality for the genome, and the phylogenetic pattern of genetic diversity. In this work I sequenced a number of nuclear protein-coding genes to test the predictions of ancient asexuality (in terms of allelic sequence divergence and phylogenetic clustering of alleles within morphological species). I also sequenced a sperm-specific gene to look for evidence of pseudogene formation and changes in evolutionary rate with respect to their sexual (meiotically-reproducing) close relatives. This work provides evidence that the signatures of ancient asexuality that were recovered are actually the product of a reasonably recent interspecific hybridization event.

This manuscript is free to download at the BMC site (although the correctly formatted version isn’t available yet).

Genetic tests of ancient asexuality in Root Knot Nematodes reveal recent hybrid origins

David H Lunt

BMC Evolutionary Biology 2008, 8:194
Published: 7 July 2008


The existence of “ancient asexuals”, taxa that have persisted for long periods of evolutionary history without sexual recombination, is both controversial and important for our understanding of the evolution and maintenance of sexual reproduction. A lack of sex has consequences not only for the ecology of the asexual organism but also for its genome. Several genetic signatures are predicted from long-term asexual (apomictic) reproduction including (i) large “allelic” sequence divergence (ii) lack of phylogenetic clustering of “alleles” within morphological species and (iii) decay and loss of genes specific to meiosis and sexual reproduction. These genetic signatures can be difficult to assess since it is difficult to demonstrate the allelic nature of very divergent sequences, divergence levels may be complicated by processes such as inter-specific hybridization, and genes may have secondary roles unrelated to sexual reproduction. Apomictic species of Meloidogyne root knot nematodes have been suggested previously to be ancient asexuals. Their relatives reproduce by meiotic parthenogenesis or facultative sexuality, which in combination with the abundance of nematode genomic sequence data, makes them a powerful system in which to study the consequences of reproductive mode on genomic divergence.

Here, sequences from nuclear protein-coding genes are used to demonstrate that the first two predictions of ancient asexuality are found within the apomictic root knot nematodes. Alleles are more divergent in the apomictic taxa than in those species exhibiting recombination and do not group phylogenetically according to recognized species. In contrast some nuclear alleles, and mtDNA, are almost identical across species. Sequencing of Major Sperm Protein, a gamete-specific gene, from both meiotic and ameiotic species reveals no increase in evolutionary rate nor change in substitution pattern in the apomictic taxa, indicating that the locus has been maintained by selection.

The data strongly suggests the tropical root knot nematode apomicts have a recent origin and are not anciently asexual. The results support that interspecific hybridization has been involved in the origin of this asexual group and has played a role in shaping the patterns of genetic diversity observed. This study suggests that genetic signatures of ancient asexuality must be taken with caution due to the confounding effect of interspecific hybridization, which has long been implicated in the origins of apomictic species.

Trudgill, DL and Blok, VC (2001) Apomictic, polyphagous root-knot nematodes: exceptionally successful and damaging biotrophic root pathogens. Annu Rev Phytopathol 39:53-77