I have a project going at the moment to examine changes in intron diversity, size and location in animal genomes. I am always a bit frustrated with the way introns are treated in many genome characterisation papers- “the genome contained Y introns with mean intron size Xbp” is usually all we get. This sort of summary stat can hide all manner of interesting trends. One measure that is often useful is intron density but unfortunately there doesn’t seem to be any standardised way to use this measure. Density is often measured as ‘introns per gene’, which is a reasonable shorthand, although since genes vary in length very considerable both within and between genomes it makes quantitative analysis very difficult indeed. I have seen ‘introns per 10kb’! This is OK, but what number to choose? What if each study chooses a different number? ‘Introns per nucleotide’ will standardise this better, and although the number will be very small, we seem to manage just fine with small mutation rate numbers and the like. But the more I think about it the less simple this seems.

Introns per bond

Something that is often overlooked when calculating the number of introns per nucleotide is that introns do not insert into nucleotides but rather the phosphodiester bonds between them. I would suggest therefore that the most accurate and effective way to specify density would be introns per bond. It seems reasonable that counting nucleotides is a convenient shorthand for this, but actually this shorthand leads to small but persistent errors. This is an unfortunate consequence of genome annotation restricting itself to nucleotides but genomic processes sometimes targeting bonds.

In the cartoon of a gene above CDS represents the protein coding region and UTR stands for the 5′ and 3′ untranslated regions. The dashes between nucleotides represent phosphodiester bonds joining the nucleotides. There are 6 nucleototides in the 5′-UTR, 9 nucleotides in the CDS and 5 nucleotides in the 3′-UTR. What would happen if we were to insert an intron in this sequence at the boundary of one of the gene regions? In which gene region would it be counted?

Coding regions almost always begin with a codon specifying a methionine residue- the start codon ATG. Nothing preceding this A nucleotide is counted as part of the CDS. Coding regions finish with a termination codon, TGA in the example above. This A nucleotide is the end of the CDS. By usual practice therefore any intron inserting into the bond between the T and the A at the 5′ end of the CDS would not be counted as part of the CDS, nor would any intron inserting into the bond between the A and the A at the 3′ end of the CDS. This is quite reasonable in many ways, but defining UTRs by reference to the CDS (after the last nucleotide, before the first nucleotide) means that the CDS has one less bond per nucleotide than do the UTRs! The 5′-UTR here has 6 nucleotides and 6 bonds, the 3′-UTR has 5 nucleotides and 5 bonds, but the CDS has 9 nucleotides and only 8 bonds where an inserting intron would be labelled as a ‘CDS intron’.

Does this matter?

Both yes and no. Counts using introns/nucleotide will be very similar to introns/bond. I am not claiming that work needs to be repeated or that substantial errors put into question previous work. But there are two issues here

  1. We should do it right. Understanding the actual insertion process requires us to use the right language. We should label introns as inserting between nucleotides to avoid confusion. You may not be confused, but try writing a script that counts introns when everything is labelled by nucleotide position.
  2. We can’t yet be sure what difference correct counts make in large data sets. The age of genomics is here. We can study hundreds of thousands of introns from lots of species and treat this mass of data statistically. The numbers of introns/nucleotide and introns/bond may look similar to our eyes, but trusting our savannah ape brains to make the right call is a risky strategy with big numbers.
The lab now uses these ‘per bond’ counts in our genomic intron scripts, which will be released when the first paper is out. I think it would be great if there was a biological standard for intron density, maybe we should even give it a unit- a Gilbert perhaps could equal one intron/10-3 bonds?
Share
 

I downloaded some datasets from the SILVA96 database. These are structurally aligned SSU rDNA sequences. I browsed through the taxonomic groups and chose annelids (N=1050) and nematodes (N=5048) as smallish tests. I downloaded these as fasta files.

I started with the annelids file. The file contain a LOT of gaps, because it comes from an alignment of hundreds of thousands of sequences of all three domains of life.
I haven’t yet found a good way to process large files to remove columns that are all gaps. It can be done in Clustal and Mesquite but these are bad choices with very large alignments. There are some online resources but my fasta files are >50-250MB, so online is not the place even if I could persuade a server to upload my files. I should really have used BioPerl SimpleAlign to remove gap columns, its probably the most flexible and able to deal with big files, but I was temporarily having trouble installing BioPerl on my desktop (a future post) and ran out of time and patience.

I ran it through Gblocks instead which does more than just remove blank columns, also trimming areas of poor alignment judge by various criteria. This reduced the file considerably.

I had previously installed FastTree, so I ran it with the command

fasttree -nt annelids.fasta >annelids.tree

It ran quite nicely and produced a viable tree.
Something strange with the timings though.

Topology done after 1242.20 sec -- computing support values
Unique: 3137/5048 Bad splits: 37/3134 Hill-climb: 259 Update-best: 11335 NNI: 4149
Top hits: close neighbors 2510/3137 refreshes 176
Time 1577.05 Distances per N*N: by-profile 0.220 (out 0.065) by-leaf 0.291
END:    2008-10-28 18:23:32----------------------------------------------
Runtime:     5886         seconds
Runtime:     01:38:06     h:m:s----------------------------------------------

The text starting with “END:” is the output of my perl script before that from fasttree. So fasttree claims to have taken 1577 seconds (26 minutes) but my script times it at 1 hour 38 minutes. I actually noted the time it started and it did take 1 hour 38 mins. I repeated with identical results. Strange discrepancy.

Share
 

The prediction I made before about a long silence once this year’s students turned up was sadly accurate. Anyway, students dealt with, grant proposal submitted, lectures (mostly) given, bureaucracy reduced (a bit), time to get on with some phylogenetics.

I was playing before with FastTree. Although it looks to have been quite well tested by its developers its always worth investigating real-world experiences. Last time I put together a little perl script to time the runs, but then I noticed that it actually reports the number of seconds taken. Hmmm did I just miss that all along or did it appear in version 1.0 and I didn’t notice. Oh well, either way I must pay more attention.

Morgan Price (the developer of FastTree) has released a version a few days ago that should compile without the malloc error I talked about before.

Share
 

In order to to see how quickly FastTree runs for me I need some automated method of timing it. While some programs like phyML return a runtime at the end FastTree doesn’t seem to. So I searched the web and found bits of perl code to put a script timer together.

I have uploaded the relevant script (posixtime.pl) to my repository website. It seems to work well for me but test it for yourself. Since it is based on POSIX I think it will only run on *nix systems (like Linux and MacOSX) although it can be made to work on Windows too perhaps (see here).

It has some placeholder code that prints out stuff and reports back how long it has taken. All the section between # Script goes here # and # Script ends here # can be deleted and replaced with the appropriate commands. So to run FastTree include a line like this-
system (FastTree -nt alignment_file > tree_file);

The system command however is also *nix specific I think. Sorry Microsoft guys I’ve never run perl on a Windows machine. There must be an equivalent way to launch external programs if you are working in e.g. ActivePerl.

I think the above command without ./ prefix depends on FastTree being installed in the correct location. This is usr/bin on my system.
Type the command: which perl
You should probably get: /usr/bin/perl
Move to the level above: cd /usr/bin
Copy the FastTree application to here: sudo cp path_to_application ./
Enter password when asked. If you don’t want to type out the path to the application you can (in OSX) just drag and drop the application into the terminal window after you have typed the sudo cp part, and it will paste in the location of the file you have dropped.

You should then be able to launch FastTree by giving the FastTree command wherever you are, without having to cd and move to the directory containing the application.


I have posted the FastTree application I described in the previous post at my file repository site, in case you don’t want to install developer tools and mess around with malloc errors.

Share
 

This is how I downloaded, compiled and got FastTree working. Its a bit obvious in places but I think detailed instructions are a good thing to have out there and Google findable. I am using a multicore MacPro 2.8GHz with 4GB RAM and OSX 10.5.4 (I’m not sure the 8 cores make any difference whatsoever if the code isn’t written to take account of them).

  • I downloaded FastTree from www.microbesonline.org/fasttree.
  • I had to use Safari to do this as Firefox wouldn’t let me right click and download.
  • FastTree 1.0.0 is available as binaries for Windows and Linux. Unfortunately there are no Mac binaries
  • I downloaded the C code file (156kb)
  • I installed the developer tools “XcodeTools.mpkg” from system software install DVD number 2. It took about 15 minutes. This allowed me to use the gcc compiler to actually make the application.
  • I opened terminal, moved to the location of the c file and issued the command: gcc -lm -O2 -Wall -o FastTree FastTree.c
  • It didn’t like that much and gave the error: FastTree.c:212:20: error: malloc.h: No such file or directory

After some Google searching I came across a couple of indications that malloc.h (I had no idea what this was) was outdated.

malloc.h not supported, use stdlib.h (http://developer.apple.com/technotes/tn2002/tn2071.html)

Most helpful was this:

Mac OS X for unix geeks
http://unix.compufutura.com/mac/ch05_01.htm
5.1.2. malloc.h

make may fail in compiling some types of Unix software if it cannot find malloc.h. Software designed for older Unix systems may expect to find this header file in /usr/include; however, malloc.h is not present in this directory. The set of malloc( ) function prototypes is actually found in stdlib.h. For portability, your programs should include stdlib.h instead of malloc.h. (This is the norm; systems that require you to use malloc.h are the rare exception these days.) GNU autoconf will detect systems that require malloc.h and define the HAVE_MALLOC_H macro. If you do not use GNU autoconf, you will need to detect this case on your own and set the macro accordingly. You can handle such cases with this code:

#include
#ifdef HAVE_MALLOC_H
#include
#endif

  • So I opened the C file in a text editor and searched for malloc.h, I found it on line 212. I then deleted that line (#include {malloc.h}) and inserted the four lines from above.
  • I repeated the gcc command to build the application from above and it worked. No errors and it produced the application in 1 second.
  • I tried to launch it and display the help file using the terminal command: ./FastTree -h
  • It didn’t work at all, so for no logical reason I just dumped the application and rebuilt it with the same gcc command. This time ./FastTree -h did launch the application and it displayed the help file. Success!
  • Using the simple instructions from the FastTree page I tried to run it on some lizard DNA sequences I had lying around. These were fasta sequences, although it claims to work on phylip also. The command was: ./FastTree -nt lizards.fasta > lizards.tre
  • It produced a tree with only the first taxon name and nothing else. The input file had mac line endings though and when I corrected that to unix it was fine.
  • I noticed that some of the names were truncated and wondered if they had been chopped at spaces. I replaced with underscores and got better results. [In the spirit of full disclosure these last two points were with the previous version of FastTree and I didn't try to replicate these errors with version 1.0.0]
  • Another problem I had with the previous version was when I accidentally had a repeated taxon in the matrix. It complained about a “non unique name X in the alignment” and wrote an empty treefile.
  • Having got around these teething problems it ran perfectly. Almost instantly writing a treefile (input fasta N=112, 923bp). The tree looked quite sensible and it had support values (local bootstraps). These are on a 0 to 1 scale, so 0.95 is 95%

Fairly straight forward (except for the malloc error) on the whole. Next I’m going to report on some bigger runs and start timing them properly. My immediate goal is to get a tree of all ~20k metazoan 18s rDNA sequences. Of course a tree of that size will bring its own problems, how to visualize it.

Share
Sep 262008
 

So when I started writing this blog I thought I would use it to outline some of the things I was working on as I went along. Not real projects, which I will write up and publish, but side projects and how I got them to work (or otherwise). Unfortunately there hasn’t been much of that, lots of reviews and comment instead. Hopefully I am going to change that now. 

Last April I posted about FastTree, the rapid NJ application that seems to be able to handle approx 40,000 OTUs. The authors say 
“FastTree inferred a phylogeny for an alignment of 39,092 proteins, including support values, in half an hour on a desktop PC”.

I actually compiled and started using it back in April, but got swamped by teaching and theses and stuff and never posted anything more. My next few posts are going to be about getting FastTree to work and how it copes with some of my datasets. Alternatively as the new crop of undergraduates arrive next week there may be a long silence instead.
Share
 

I’ve just had a manuscript published at BMC Evolutionary Biology so thought I’d share a synopsis.
I’ve been interested in root knot nematodes for a while as they are a powerful system for evolutionary genetics and amazingly successful parasites of plants- especially crop plants. Trudgill and Blok (2001) estimate that they “have host ranges that encompass the majority of flowering plants” and that Meloidogyne incognita “is possibly the single most damaging crop pathogen in the world”.
Interestingly, the 3 most damaging species are obligatory asexual mitotic parthenogens (apomicts), which is perhaps counter intuitive given the ongoing arms race between plants and nematode. Root Knot Nematodes have previously been suggested to be ancient asexuals, and I’m interested in the consequences of asexuality for the genome, and the phylogenetic pattern of genetic diversity. In this work I sequenced a number of nuclear protein-coding genes to test the predictions of ancient asexuality (in terms of allelic sequence divergence and phylogenetic clustering of alleles within morphological species). I also sequenced a sperm-specific gene to look for evidence of pseudogene formation and changes in evolutionary rate with respect to their sexual (meiotically-reproducing) close relatives. This work provides evidence that the signatures of ancient asexuality that were recovered are actually the product of a reasonably recent interspecific hybridization event.

This manuscript is free to download at the BMC site (although the correctly formatted version isn’t available yet).

Genetic tests of ancient asexuality in Root Knot Nematodes reveal recent hybrid origins

David H Lunt

BMC Evolutionary Biology 2008, 8:194
doi:10.1186/1471-2148-8-194
Published: 7 July 2008

Abstract

Background
The existence of “ancient asexuals”, taxa that have persisted for long periods of evolutionary history without sexual recombination, is both controversial and important for our understanding of the evolution and maintenance of sexual reproduction. A lack of sex has consequences not only for the ecology of the asexual organism but also for its genome. Several genetic signatures are predicted from long-term asexual (apomictic) reproduction including (i) large “allelic” sequence divergence (ii) lack of phylogenetic clustering of “alleles” within morphological species and (iii) decay and loss of genes specific to meiosis and sexual reproduction. These genetic signatures can be difficult to assess since it is difficult to demonstrate the allelic nature of very divergent sequences, divergence levels may be complicated by processes such as inter-specific hybridization, and genes may have secondary roles unrelated to sexual reproduction. Apomictic species of Meloidogyne root knot nematodes have been suggested previously to be ancient asexuals. Their relatives reproduce by meiotic parthenogenesis or facultative sexuality, which in combination with the abundance of nematode genomic sequence data, makes them a powerful system in which to study the consequences of reproductive mode on genomic divergence.

Results
Here, sequences from nuclear protein-coding genes are used to demonstrate that the first two predictions of ancient asexuality are found within the apomictic root knot nematodes. Alleles are more divergent in the apomictic taxa than in those species exhibiting recombination and do not group phylogenetically according to recognized species. In contrast some nuclear alleles, and mtDNA, are almost identical across species. Sequencing of Major Sperm Protein, a gamete-specific gene, from both meiotic and ameiotic species reveals no increase in evolutionary rate nor change in substitution pattern in the apomictic taxa, indicating that the locus has been maintained by selection.

Conclusions
The data strongly suggests the tropical root knot nematode apomicts have a recent origin and are not anciently asexual. The results support that interspecific hybridization has been involved in the origin of this asexual group and has played a role in shaping the patterns of genetic diversity observed. This study suggests that genetic signatures of ancient asexuality must be taken with caution due to the confounding effect of interspecific hybridization, which has long been implicated in the origins of apomictic species.

—-
Trudgill, DL and Blok, VC (2001) Apomictic, polyphagous root-knot nematodes: exceptionally successful and damaging biotrophic root pathogens. Annu Rev Phytopathol 39:53-77

Share
 

The script I referred to in my last post is actually seqConverter.pl written by Olaf Bininda-Emonds, with a few minor modifications to send the output directly to phyml. I thought I would flag up his site which has a large and very useful collection of perl scripts for phylogenetic data wrangling. These are open-source scripts and I frequently find myself using and modifying these programs. Thanks Olaf!

Share
 

I’m not a very competent perl programmer. Even writing the word programmer here makes me slightly embarrassed. I do carry out frequent sequence conversions and manipulations with perl scripts I’ve put together though. Sometimes when I need to run a script many times I’ve found the most irritating thing is launching the scripts and pointing it towards the right input file. A much simpler option in this case is to save the script as an application and drop the files onto it to carry out the conversion. I’ve come across two options for doing this (all this is very Mac-centric I’m afraid but I’d be interested to see MS equivalents in the comments).
The first is the open-source program Platypus by Sveinbjorn Thordarson that “can be used to create native, flawlessly integrated Mac OS X applications from interpreted scripts such as shell scripts or Perl and Python programs”. Make sure that the “is droppable” check box is selected. I found it quite straightforward to turn scripts into droppable applications this way. As it says on the site, but it needs some remembering, you will need to modify your script slightly to accept the infile correctly. The basic tutorial page says the following

Enabling “Is droppable” for an app will modify the property list for for the app in question so that it can receive dropped files in the Dock and Finder. These files are then passed on to the script as arguments via @ARGV. However, the first argument to the script ($ARGV[1], $1 etc., depending on your scripting language of choice) is always the path to the application bundle (for example “/Applications/MyPlatypusApp.app”).

Essentially this means that (in perl at least) where your input file would be identified right at the start by @ARGV[0] it should be changed to @ARGV[1] before creating your application.
Another interesting aspect is the ability to bundle in code files referred to in your script. This means for example that if you have a script that depends on bioperl, it needn’t break, just add in the path to the parts of bioperl needed.

The second option is an AppleScript droppable application. I have to admit that I have never written an AppleScript but I came across this post recently from TUAW outlining the “do script” command. Applescripts can be saved as dropplet applications onto which you drop input files. A bit of Googling reveals people using both do script “script.pl and do shell script “script.pl”. The last seems a bit odd since script.pl is a perl not shell script, but it looks like either will work.
As an example I once created a perl script that took an alignment in a range of formats and converted to a format acceptable to phyml, then ran the program using standard settings of my choice. I have this on my desktop as a droppable application called “runPhyml”. It works very nicely for generating quick trees.

Share
Apr 162008
 

There is a new release (94) of the SILVA database of ribosomal DNA sequences. 23,133 Metazoans, 88,997 Eukaryotes and 606,879 SSU sequences in total.
I’m having a few problems installing ARB on a new machine but need to start exploring this phenomenal phylogenetic resource more closely. One thing I would love when browsing large trees purely for fun is to see pictures at the tips. I wonder if you could automatically grab taxon images, perhaps from Yahoo, maybe in a similar way to Rod Page’s iSpecies? A few problems with this I guess, but it ought to be quite achievable. It would certainly make the trees much more fun to non taxonomists. The “animal tree of life” paper in Nature that I mentioned in a previous post provokes many more positive comments from my colleagues when they see the tree than I have ever seen with other phylogenies. I suspect it is because the authors include nice drawings at the tips. This simple thing might turn out to be very important for accessibility and dissemination of systematic info.

Share

Switch to our mobile site