Dec 022015

There is now a second Tardigrade genome described in a manuscript (Koutsovoulos et al 2015), only days after the first (Boothby et al 2015– not OA). Koutsovoulos et al however strongly suggests that the widely publicised rampant horizontal gene transfer (HGT) is problematic. This alternate genome comes from Mark Blaxter’s group in Edinburgh, and has in fact been publicly available since 2014. I was surprised that the first genome paper didn’t compare to this in their paper (I’m pretty sure this is true, but can’t get access to PNAS content writing from home to double check).
So I’ve learned a few things reading Tardigrade papers and I’ve been very frustrated with Twitter’s 140 characters. Don’t want to try to do complex ideas on Twitter anymore, thats not what it’s for and it works very badly when you have to have a series of numbered tweets to say anything of substance. What I wanted to say was about publishing genomes rather than genome assemblies.

Preprints are great

Firstly, I’ve been a big fan of preprints for a while. This situation just confirms to me their central place in modern biology. There is a high-profile paper. Another group thinks that it doesn’t match with their analysis of same. Within a week there is a proper manuscript available to read. Its not peer reviewed, but otherwise its a normal manuscript. I read and judge it for myself, I can cite it, share it, tweet about it, and write about it in a blog. Preprints are very valuable, there is no excuse for waiting 6 months for some 20th century paper journal to reluctantly grind out a short note on your concerns. Write what you want to say and get it out there, then work on getting it in to a journal if you must.

Late release of data hurts science

Secondly I have learned that there can be a very poor delay of release of all the data with genome papers. Twitter was alive with ‘where can I download the Tardigrade genome?’, ‘data isn’t available, how can that be?” and ‘its on GenBank, just waiting for them to release it’. There was no obvious attempt to not release everything, but data release has to be planned for differently. Data release must be exactly, to the minute, synchronised to the release of the manuscript. Or release the data earlier- you aren’t going to get scooped by releasing the data a day before the manuscript, and that way you can check everything is in place. In their comparative genomics preprint Koutsovoulos et al clearly needed more access to central parts of the first tardigrade genome assemblies.

the UNC raw reads were not yet available. We were also unable to confirm directly expression of the UNC genome HGT candidates because we did not have genome coordinates for the gene predictions.

These are things I would have thought would have come out with the paper, and no later, and thats a shame for science. I’m not implying deliberate obfuscation, just the normal human condition of not getting your shit together in time, something I suffer from too, though am fighting hard against because I think its really important when publishing.

Linked data and findable data are important

Thirdly, make use of INSDC (ie GenBank, EMBL, DDJ) in addition to any other resources. Koutsovoulos et al had made their data open in early 2014. They had annotated it, provided resources for exploring it, and were improving its quality. But as I understand things it wasn’t on the SRA (correct me please), and although their url is memorable ( maybe it wasn’t found by the NCU team? With hindsight should they have publicised and linked to their genome more? This is related to Tim Berners-Lee’s fifth star for data sharing ‘link your data’. That said I love Badger, the genome resource the second genome has been held in. Its a really great environment to explore and find stuff out, I wish it were more commonly used.

The age of genome papers isn’t over

I’ve heard a lot of people repeat ‘the age of single genome papers is over’. It really isn’t. However its clear that comparative genomics is very important, and don’t act surprised because comparative genomics is just normal biology: we collect data from things, we have a look at what is the same and different, we invent ideas and hypotheses, we examine those and try to make sense of stuff. But we can’t do that without single genomes, and people are reluctant to do genomes they don’t get credit for.

I wish that there was a eukaryotic equivalent to Genome Announcements. That might also help to get new genomes out there earlier, rather than waiting for a traditional story to emerge. On the other hand how is that different from just putting a preprint on BioRxiv with your description of the genome, and getting a doi? In the long run the papers we usually want to read are comparative genomics papers, and authors getting credit for releasing the data early, openly and discoverably should be encouraged however we can.

Exciting days are ahead, not just of detailed comparisons of tardigrade genomes but of comparative genomics more broadly, there are lots of surprises and new knowledge out there. Also I’ll say what a lot of people are thinking: time to reexamine the Bdelloid rotifer genome’s ‘HGT’.

Disclosure: I know most of the Edinburgh crowd well and have published with several of them.