Dave Lunt

Oct 172014

ReproPhylo treeWe are still largely missing the benefits of reproducibility in phylogenetics. I think that this makes our lives unnecessarily difficult and makes us particularly poorly prepared to confront modern data-rich phylogenetics. In this first post “Why” I want to talk about why we need reproducible phylogenetics. Then, in part two, “What“, I’m going to talk about some possible approaches to reproducible phylogenetics. In part three, “How“, I’m going to look at some existing software solutions. Lastly, in “ReproPhylo“, I’m going to write about the work my lab is doing to bring these approaches together to create new reproducible phylogenetics solutions.

tldr; Reproducibility helps us do better phylogenetics and do it more easily. There are a number of partial solutions out there. We introduce the ReproPhylo framework for easier + better reproducible experimental phylogenetics.

Why do we need reproducible phylogenetics?

Three short answers, then I explain below.

Answer 1: to do better quality science. This is achieved by being able to build on and extend other people’s work. It is also achieved by being able to take an experimental approach to phylogenetic methodologies.

Answer 2: to make your life much easier. The person most likely to reproduce your work is “future you”, making it easy to reproduce, and then modify, your analysis will save you lots of time. It will help you to do more actual research and less reformatting of files and coaxing belligerent applications to read them.

Answer 3: If your work isn’t fully reproducible is it really science? Sure its nice work that clarifies some important issues, you’re a bright person and its likely correct, but if its not reproducible …. what the hell are you thinking? Is this why you got into science? To accept stories from other scientists based on them being bright and it sounding right? You are much more sceptical than that about the science you read, so shouldn’t people also be sceptical of your work. Yeah, exactly.

Standing on the shoulders of giants

The ability to extend the work of others, to stand on their ‘shoulders’ [1] and reach higher, is how progress is made. “Wow, if I just added these species to that tree, used their analytical approach, I could actually test [whatever]”. But can you add species and use that approach? Or do you have to start from scratch, collecting sequences from GenBank and trying to reproduce their work before extending it?

How much do you want your work to have impact going forward? Make it easy for people to extend your work and you will be influential.

‘Experimental’ phylogenetics

This refers to approaches where we test the influence of method, parameter choice, and data inclusion on our tree structure. How many studies have you seen where people explore parameter space exhaustively and explicitly compare the phylogenies produced. Not many I would guess. Any? The reason is that it is too difficult to experiment when manual approaches to phylogenetics are used. Have you ever experimented with alignment parameters in your MSA program of choice? Most phylogeneticists usually only run the defaults, you can check the Methods sections of papers to confirm this. If I have to align sequences with 6 different programs, each with 50 different combinations of parameters, and then compare some characteristics of the 300 alignments and resulting trees this will be a truly mammoth piece of work. If however I have an entirely reproducible pipeline that will iterate over parameter space and produce a detailed report with clear summaries of alignment characteristics and tree variability then this becomes not an exceptional piece of work but just something I would typically do before getting down to detailed analysis of the question at hand. If a reviewer or critic thinks I have chosen the wrong range of parameters to optimise they can simply add others and hit RUN on my pipeline to compare to my optimum values. The robustness of science improves.

Who will actually reproduce my work?

It could be anyone. What if Professor Big loves your work and wants to extend it, great! But the reality is that the person who will certainly need to reproduce and extend your work is Future You! Make life easy for yourself by starting out reproducibly, anyone else calling you a giant and wanting to stand on your shoulders is a bonus.

What makes you think phylogenetics isn’t pretty much OK now?

Long and painful experience makes me think that. Why don’t you try reproducing 3 phylogenetics results from papers and then your perspective will have changed. Can you get their data easily, or is it a list of Genbank Identifiers in a supplementary Word table that you then have to type in to NCBI website? Can you run their software? If its an old paper can you even find their software? Do you know the dependencies to run it? Do you know what version they ran? Do you know the parameters they used?  Are the default parameters now the same as then? Did they exactly record data transformations? Maybe they changed sequence names between the original files and the tree figure. Maybe some taxa were excluded as problematic. Maybe the alignment was manually improved. Maybe some alignment regions were masked. All that is fine, but do you know exactly what they did and how? Did they archive the final tree or only a picture of it? This would only allow you compare by visual inspection to see if you have reproduced a previous study. It is estimated that >60% of phylogenetics studies are ‘lost to science’ [2]. This is a problem.

What is Reproducibility?

I’m not going to cover the semantic differences between reproducibility, replication, repeatability etc. Here I take a practical view of reproducibility as a term used routinely to represent the above terms. I really like this video of Carole Goble explaining the concepts of reproducible research.

Reproducibility is the correct way to do science.

Reproducibility is so integral to what we consider the scientific process that it is hard even to make a counter case here, so I won’t really try. So why isn’t reproducibility the norm? Well a technically poor form of reproducibility is the norm, the Methods section of the journal article. Later in this series I suggest that technical challenges have prevented complete and efficient reproducibility in the past, it hasn’t been your fault, but now those challenges are pretty much solved (part3; How) and we should grasp and benefit from these new possibilities.

[1] Standing on the shoulders of giants. Wikipedia. Available: http://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants.

[2] Magee AF, May MR, Moore BR: The Dawn of Open Access to Phylogenetic Data. arXiv [q-bioPE] 2014. http://arxiv.org/abs/1405.6623

Sep 052014

I received an email from Alex Lang and he asked about my current use of Electronic Lab Notebooks

Hi Dave,

I’m a physics graduate student who just started using a WordPress based ELN. I really found your thoughts on ELNs helpful, especially:

Since someday I want to be a PI, I had some questions for you. If you would prefer to answer as a blog post, that would be fine by me.

I wondering if you could elaborate more on the mechanics of how you actually implement ELNs with your students. For example, some questions I had are:

Do students have an ELN on your website? Or do they host their own?
What happens when students leave the group? How do you keep a record of the ELN?
Does the whole lab share ELNs with each other? Or do you restrict it to you seeing students ELNs?
Do you ever share parts of the ELN with collaborators / outside people? Does that work out?

Thanks again for the insight into ELNs!


Hi Alex, thanks for the prompt to write something

Even though it was quite a while ago that I wrote my posts about ELNs most of it still holds for me. The post you link to above was talking about graduate students and postdocs, whereas the slides refer to undergraduates doing a short project but actually both are implemented in a similar way.

Do students have an ELN on your website? Or do they host their own?
What happens when students leave the group? How do you keep a record of the ELN?

I can answer these two together. I set up a blog ELN for all, and nobody sets up their own. Two reasons for this: Firstly, setting up a blog can be intimidating if you’ve never done it before, as many starting an ELN have not. Secondly, the blog is owned by me and can’t easily be deleted. This ensures I always have the experimental record even if researchers leave. I think this is important. Of course the researcher has admin rights and can always save a copy when they leave and take it with them, but I always retain the ELN record.

I host all ELNs on WordPress.com. The reason is that I trust them more than I trust my personal domain. You will have a reduced list of themes and plugins that can be used, but it is fast and robust.

Does the whole lab share ELNs with each other? Or do you restrict it to you seeing students ELNs?

Actually, only the project supervisor (me) usually sees the ELN. This is not really a decision, just how it worked out. I would be happy to let anyone in the lab see, and the people writing the ELN probably wouldn’t mind either, but nobody really wants to read the experimental minutiae apparently. We have regular lab meetings and this provides all the details most people want. Sometimes I will add a postdoc and a PhD student onto each other’s ELN if they are on similar projects, but generally people are happy without seeing details. In some ways I think this is a shame, we can all learn something from seeing how others do science and write up the record. I may try to change that and put everyone on every ELN by default.

Do you ever share parts of the ELN with collaborators / outside people? Does that work out?

I have shared an ELN with another co-I on a grant. It worked well, though it was not a major source of info for them (I don’t really know how often they read it). They preferred meetings where the researcher would summarise and discuss rather than reading the experimental record (which is sometimes a bit dry). Other co-Is have not really wanted to even see the ELN. I however quite enjoy enjoy browsing new work by my people and I am excited when I get a new post notification!

This brings me to something I hadn’t really thought much about before. Use of ELNs is not primarily a technology issue, it is a personality issue. If you don’t want to read a paper notebook, you won’t want to read an electronic one. Even if you don’t want to read other ELNs you should still keep one yourself:

  • It will help you in writing your manuscript. Some descriptions and methods will already have been written and only require copy/paste. I recently saw an excellent ELN post by my postdoc Amir that was a manuscript draft. Just explaining in the ELN what had been done and what conclusions could be drawn had created that first manuscript version.
  • It is more robust. Your leaking ice bucket cannot ruin the whole year’s experimental record. There is version control. It is backed up in the cloud and if you are wise it has a regularly saved local copy too.
  • The features of a WordPress ELN make it powerful. Search, tags, and categories make my day much easier and more productive.
  • I think it is just easier to keep information this way; easier to paste in text, and screenshots, and protocols, and web links. Even for people working at the bench rather than the computer I think it is easier.

My personal ELN is still very successful I think. I don’t research every day which makes the search function vital. I have had some minor failures but it is the best experimental record I have ever kept. My failures have taught me about the value of records with lots of searchable tags, the importance of explicit data file versions, and never to scribble something to ‘type up later’. The times the ELN gets flaky is when I’m too impatient and do the next thing before really creating a record for the last.

I have 2 students starting later this month, and I will set up a WordPress blog ELN for each. If anyone would like to add their views and experiences please leave comments, or email me.


Mar 282014

We have two jobs open at the moment in the Hull Evolutionary Genetics group @EvoHull. Both are, I think, quite exciting; not your standard postdoc positions and the group is looking forward to getting two new colleagues.

1-Year Lectureship in Evolutionary Biology

This is maternity cover for Dr Domino Joyce. You will be covering teaching in evolution, ecology, genetics and similar. All the teaching is already prepared, but you can modify and improve as much as you wish. You will be strongly encouraged to be part of the dynamic EvoHull group, that has regular lab meetings, journal clubs, workshops and the like. Its a really fun place to work and you could get  great experience, not just with university teaching but also research and forge new collaborations. This position could really improve your CV when applying for permanent lectureships! Feel free to discuss the position with Domino Joyce or me. Closing date 10th April 2014. Apply here: https://jobs.hull.ac.uk/Vacancy.aspx?ref=FS0094

2-Year Bioinformatics Research Fellow in Evolutionary and Environmental Genomics

This is an exciting new Research Fellow position for a bioinformatician to work with staff in Evolutionary and Environmental Genomics. We are looking to work with a bioinformatics colleague and scientist, this is not a technical post. We have quite a number of projects, most already with data, on which you could take the lead. We would additionally welcome the development of new projects in collaboration with staff in the group. We anticipate that for the right candidate this could be a very productive fellowship in terms of publications and collaborations. We know that there are  a lot of  positions open for bioinformaticians at the moment, but something that stands this opportunity apart is that its a fellowship not a technical position. You will be treated as a colleague, get to choose form a range of projects, build research collaborations, and develop your own interests alongside the core projects. This is great position for someone who has existing genomic bioinformatics skills, is a first rate scientist, and likes writing lots of papers. Please feel free to discuss the position with me. Closing date Sunday 24th April 2014. Job advert here, apply here: https://jobs.hull.ac.uk/Vacancy.aspx?ref=FS0093

Other positions

We regularly have postdoc positions to advertise, but if you would like to be pro-active we would love to hear from you. Have a look at the staff on the evohull.org website and get in touch. Several of us have projects that you could adapt to your own tastes. Our department has a great track record of really supporting fellows (several of whom have gone on to permanent positions) so if you would like to apply for an independent fellowship to work here, make contact and we can help you to develop it (and help you through the bureaucracy too).



School of Biological Biomedical and Environmental Sciences, University of Hull, UK

Hull named in Sunday Times ‘best cities’ list :)

EvoHull group website

Follow @EvoHull on Twitter

Jan 242014


I’m pretty proud of some parts of my workflow: electronic lab notebook, reproducibility, open data files, (semi-obsessive) automated data backups etc etc. But pride often comes before a fall. I had a bad experience this week where I thought I had lost some important phylogenetic data files (I found them eventually), and I’m writing this to work through what I did wrong, and what I need to change in my work routine.

About 4 years ago I built a phylogenetic tree of some cichlid fish. It was a small, relatively simple analysis, just to describe the phylogenetic relationship between some test fish species in a collaborative genomic experiment.  The pretty picture of the phylogeny was a manuscript figure, the analysis had been written up in my ELN as I did it, data files were backed up, no need to worry. Time passed, the paper stalled, and then went through tediously long review, but now I need to submit the files to Dryad and TreeBase ASAP. Fortunately I have the original sequence alignments, the notes on the analysis, and the tree file. Or do I?

This was a few years ago and the way I do phylogenies has changed quite a bit since then (subject of another post). My problem was that I had carried out three similar cichlid projects around the same time. Three cichlid phylogenies using the same genes to address three different questions. I had lots of iterations of file analyses for each data set as I iterated through analysis parameters and approaches. I had made at least two errors:

Sloppy naming

I had been very unimaginative in my file and folder naming schemes. Lots of things called cich_phylo1 or Bayes_new or cichlid_ND2. Lots of almost identical files in nested folders to search in order to find the one that had generated the figure. Someone I once worked with had the strategy of creating enormously long filenames detailing the analyses that had generated the file. It was impressive but I’m not sure I could do it, though it might work well for generated rather than manually created files. Maybe I’ll adopt it for folder names?

Full file paths

But surely if I had a good experimental record it would identify the exact file I was dealing with. Yes, it does, though I shamefully often neglected to give the full file path. It shouldn’t have been a problem but the analysis was actually done in a very unusual location on my HD (a folder shared with a linux partition) and falling outside my search. The reason I failed to give full file paths was that I was working fast, to a short deadline, and running more than one analysis simultaneously. I was pasting data details, ML parameters, results, and next moves, into my lab book but I was doing it all too fast. If I had an easy way to insert full paths I might have used it, but I didn’t. In OSX you can create shortcuts to copy the full file path and this is now only a right click away, so I have no excuse for not saving full paths into my ELN now.

NB, file modification dates do not always persist

Also, although I had the creation date of the files from my ELN, I wasn’t confident that the dates still persisted. OSX changes modification dates sometimes for no reason. Open a file, read, close- bam! its folder now has today’s date. Also some backups I did a few years ago used ftp (by accident) obliterating the file dates. Although the files in question did have the right dates, I had realised early on in my search that I couldn’t rely on that.

What would have helped?

  • Full file paths
  • Unique descriptive names, of folders if not files as well
  • Annotation of ELN blog posts with a “submitted file” tag or some way to differentiate experimental iterations from the final thing.
  • Writing a text file for the final folder detailing the publication of files (effectively tagging the final version for system searches)
  • Uploading final files immediately to Figshare. This would have given them each a doi, and that could have be put in the figure legend immediately. Useful.


This all speaks to a bigger issue- reproducibility. In documenting all my analyses I was relying entirely on myself and my ‘good behaviour’. But one bad day, one error, one interruption by a student just as you are writing something, and the record is lost. This is not an issue however with workflows that automatically generate reports along with the results and figures. Here the generation of a complete and detailed record never fails, as long as the script is working correctly. This is the way I now try to do things.

My final point would be: you don’t have a real backup system until you’ve tested it in the real world. A thought experiment of how you would recover isn’t going to be enough. I thought I was fine, but I nearly wasn’t.

Jul 172013
Godfrey Hewitt

Godfrey Hewitt in 2001 after examining my first ever PhD student

10 JANUARY 1940 – 18 FEBRUARY 2013

I was asked to write a piece about my PhD supervisor Godfrey Hewitt for the UK Genetics Society magazine, and have reproduced a version here. I’d been putting off writing about Godfrey since he died in February, making excuses to myself, so a big thank you to the editor Manuela Marescotti for prompting me to just sit down to type. 

Godfrey Hewitt was an outstanding researcher, mentor, teacher, and professor of evolutionary biology at the University of East Anglia. Godfrey was an excellent geneticist who championed the field and promoted the incorporation of molecular genetics into diverse biological fields throughout his distinguished career. A probably incomplete list of the disciplines in which he applied evolutionary genetics might include speciation, phylogeography, hybridization, phylogenetics, molecular evolution, cytology, ancient DNA, conservation, pest biology, animal domestication, island biogeography, population genetics and molecular ecology. It is hard to overstate how influential Godfrey was in several of these areas. He is very highly cited, making him by some metrics one of the world’s most influential ‘ecologists’, received many awards, and had several conferences organized in his honour. Perhaps more importantly though, along with his many collaborators, he synthesized a change in scientific worldview for those working in the areas phylogeography, speciation and Quaternary biology.

Born in Worcester, and always proudly associating with the city, Godfrey chose in the late 1950s to become an undergraduate at the University of Birmingham. This decision was made largely due to the department’s expertise in genetics, and Godfrey later carried out his PhD research there with Kenneth Mather, John Jinks and Bernard John. It was genetics that he initially identified as both personally fascinating and of increasing importance in biology, a view that he maintained throughout his career and which would be hard for anyone to argue with today. Something he proudly recognized, and often mentioned, was the academic rigour of genetics-based science compared to some other disciplines in biology, and scientific rigour was an important component of his own research.

Godfrey was far from a one-dimensional character, and conversations with him about literally almost anything would soon appear and then spiral into completely different and fascinating directions. This was sometimes disconcerting, especially for students first meeting him at conferences, as although he was always very friendly you were quickly far from the topics on which you might have rehearsed speaking to the great man. In addition to a broad scientific knowledge, history, geography, human civilizations, current affairs and sport would all be topics for strong, sparky, and often provocative views. In conversation on many science topics I often found myself wondering, slightly bemused, how on earth he knew anything about this specific and obscure area. He often didn’t, but as with all great scientists he could incisively follow the logic (or lack of it) of an argument without prior knowledge. This excellent foundation of logic and scientific rigour was something that he imparted to the very many scientists who passed through his lab. As a PhD student in Godfrey’s group I learned to think like a scientist, and although there are many things I owe him, this is perhaps the most valuable.

Godfrey Hewitt was one of the most intelligent people that I have ever met. This may surprise some who have spoken with him, as his bonhomie and down to earth common sense were a million miles away from the quirky boffin-like ‘intelligence’ with which the popular media caricaturizes outstanding scientists. Godfrey though was not for intellectual showmanship, he was for getting things done, and the ability to get truly important and complex problems solved is as good a definition of intelligence as I have found. This was Godfrey’s real talent. He could see the wood for the trees, the wood in all its beautiful complexity, the patterns by which the wood had come to have its position, composition and structure, and the relevance of this for other biological systems. He was interested in, but did not obsess over, small areas of methodological or theoretical advance, preferring instead to collaborate productively with those who were experts. This approach and vision was the basis for many of the significant advances that he synthesized.

Godfrey worked extensively with the journal Molecular Ecology, including a period as senior editor. Colleagues speak of the huge amount of time he freely gave, not only reflected in the handling of prodigious numbers of manuscripts but also in the advice and discussion with authors. His generosity with his time was a central part of his character, and extended outside of journal activities being equally given in person to those who approached him at meetings, came to his lab, or just happened to work in the same building.

Much has been written in tribute about his exceptional mentorship of students and postdocs, for which he won a Nature lifetime achievement award. He gave personal support and scientific mentorship naturally and spontaneously, which is a topic frequently returned to by those who worked with him. It is not contradictory to say that although almost all remember Godfrey fondly he could also be very tough. He did not tolerate foolishness, selfishness, or inactivity, and would be very direct with those who disappointed him. This toughness has left a positive mark, still subconsciously setting the bar very high for many of his students and postdocs, even though their enduring memory may still be his fatherly support. Very many of his former lab members have themselves gone on to academic positions worldwide and his scientific genealogy is truly impressive.

Godfrey died in February 2013 after a stubborn battle with cancer that had lasted for a number of years. He will be remembered by most for his exceptional scientific legacy although this impressive body of work will be eclipsed by his generous humanity for those who knew him.

Dr Dave Lunt, The University of Hull, June 2013

Godfrey Hewitt Wikipedia page

UEA tribute page

Lewis Spurgin’s excellent blog post

Heredity tribute

Molecular Ecology tribute 2013

Godfrey Hewitt — Recipient of 2005 Molecular Ecology Prize

Telegraph newspaper obituary

Mar 162013

nowheretogoIn today’s Guardian newspaper geneticist Steve Jones has a short column replying to a 7 year old child who had asked “Will humans evolve into a new species?“. Jones is known in the UK as the media’s favourite geneticist and evolutionary biologist; he is a frequent guest on media shows and contributor in print media. Unfortunately, although very polished, and far from incompetent, he really isn’t very good with the details. He seems to be a self-confident man and often promotes his personal (not very mainstream) views at the expense of what evolutionary geneticists in general think. I don’t like this much, especially when the places he does it are looking for science information as currently understood rather any one person’s views.

Replying to the 7 year old today he first talked about how the speciation process is driven primarily by natural selection (I’m not going to address that in this post though many would be uncomfortable with that idea too). In the second part of the column he goes on to run out his view that evolution has stopped for humans. I’m actually not going to pick apart this silly idea, though many others have, but really just to encourage him to publish as soon as possible. I haven’t found any academic paper in which he puts forward this view, though he has been talking about it in the media for approximately 20 years. If this idea were true it would be important, very important, and very interesting. I would love to read that paper. He should gather his evidence and publish it as soon as possible in a peer reviewed open access scientific journal. Or else shut up.

Some other scientists’ views on Steve Jones’ ideas:

Human evolution stopping? Wrong, wrong, wrong
No Virginia, evolution isn’t ending
Evolution, why it still happens (in pictures)
Steven Jones is being silly
Not the end of evolution again!
Some comments on Steve Jones and human evolution

Mar 102013

Error_404I’ve been thinking about sustainable and accessible archiving of bioinformatics software, I’m pretty scandalized at the current state of affairs, and had a bit of a complain about it before. I thought I’d post some links to other people’s ideas and talk a bit about the situation and action that is needed right now.

Casey Bergman wrote an excellent blog post (read the comments too) and created the BioinformaticsArchive on GitHub. There is a Storify of tweets on this topic.

Hilmar Lapp posted on G+ on the similarity of bioinformatics software persistence to the DataDryad archiving policy implemented by a collection of evolutionary biology journals. That policy change is described in a DataDryad blog post here: http://blog.datadryad.org/2011/01/14/journals-implement-data-archiving-policy/ and the policies with links to the journal editorials here http://datadryad.org/pages/jdap

The journal Computers & Geosciences has a code archiving policy and provides author instructions (PDF) for uploading code when the paper is accepted.

So this is all very nice, many people seem to agree its important, but what is actually happening? What can be done? Well Casey has led the way with action rather than just words by forking public GitHub repositories mentioned in article abstracts to BioinformaticsArchive. I really support this but we can’t rely on Casey to manage all this indefinitely, he has (aspirations) to have a life too!

What I would like to see

My thoughts aren’t very novel, others have put forward many of these ideas:

1. A publisher driven version of the Bioinformatics Archive

I would like to see bioinformatics journals taking a lead on this. Not just recommending but actually enforcing software archiving just as they enforce submission of sequence data to GenBank. A snapshot at time of publication is the minimum required. Even in cases where the code is not submitted (bad), an archive of the program binary so it can actually be found and used later is needed. Hosting on authors’ websites just isn’t good enough. There are good studies of how frequently URLs cited in the biomed literature decay with time (17238638) and the same is certainly true for links to software. Use of the standard code repositories is what we should expect for authors, just as we expect submission of sequence data to a standard repository not hosting on the authors’ website.

I think there is great merit to using a GitHub public repository owned by a consortium of publishers and maybe also academic community representatives. Discuss. An advantage of using a version control system like GitHub is that it would apply not too subtle pressure to host code rather than just the binary.

2. Redundancy to ensure persistence in the worst case scenario

Archive persistence and preventing deletion is a topic that needs careful consideration. Casey discusses this extensively; authors must be prevented from deleting the archive either intentionally or accidentally. If the public repository was owned by the journals’ “Bioinformatics Software Archiving Consortium” (I just made up this consortium, unfortunately it doesn’t exist) then authors could not delete the repository. Sure they could delete their own repository, but the fork at the community GitHub would remain. It is the permanent community fork that must be referenced in the manuscript, though a link to the authors’ perhaps more up to date code repository could be included in the archived publication snapshot via a wiki page, or README document.

Perhaps this archive could be mirrored to BitBucket or similar for added redundancy? FigShare and DataDryad could also be used for archiving, although it would be suboptimal re-inventing the wheel for code. I would like to see FigShare and DataDryad guys enter the discussion and offer advice since they are experts at data archiving.

3. The community to initiate actual action

A conversation with the publishers of bioinformatics software needs to be started right now. Even just PLOS, BMC, and Oxford Journals adopting a joint policy would establish a critical mass for bioinformatics software publishing. I think maybe an open letter signed by as many people as possible might convince these publishers. Pressure on Twitter and Google+ would help too, as it always does. Who can think of a cool hashtag? Though if anyone knows journal editors an exploratory email conversation might be very productive too. Technically this is not challenging, Casey did a version himself at BioinformaticsArchive. There is very little if any monetary cost to implementing this. It wouldn’t take long.

But can competing journals really be organised like this? Yes, absolutely for sure, there is clear precedent in the 2011 action of >30 ecology and evolutionary biology journals. Also, forward-looking journals will realize it is their interests to make this happen. By implementing this they will seem more modern and professional by comparison to journals not thinking along these lines. Researchers will see strict archiving policy as a reason to trust publications in those journals as more than just ephemeral vague descriptions. These will become the prestige journals, because ultimately we researchers determine what the good journals are.

So what next? Well I think gathering solid advice on good practice is important, but we also need action. I’d discussions with the relative journals ASAP. I’m really not sure if I’m the best person to do this, and there may be better ways of doing it than just blurting it all out in a blog like this, but we do need action soon. It feels like the days before GenBank, and I think we should be ashamed of maintaining this status quo.


Dec 042012

Today I got an email from David E. Schindel, who is the Executive Secretary of the Consortium for the Barcode of Life, announcing Google funding for DNA barcoding. The project aims to create a reference library of endangered species COI sequences so that DNA barcoding can be used as a tool against wildlife trafficking. Good for them, this is a good use of money.

However I was shocked to read later in the email

DNA barcoding is a technique developed at a Canadian university for identifying species using a short, standardized gene sequence

What? Either this was typed and not checked in a bad moment or we have entered the world of barcoding political spin. I assume that ‘at a Canadian university’ refers to Guelph, where the the Canadian Centre for DNA Barcoding is based, lead by Paul Hebert.

The problem is that this Canadian group didn’t invent barcoding, neither the name nor the discipline. I can’t really go into a detailed history of DNA barcoding in this post but the statement in this email makes me squirm, just like when I hear politicians take credit for natural events or someone else’s work. But the meme is out there, the Consortium for the Barcoding of Life begins

In 2003, researchers at the University of Guelph in Ontario, Canada, proposed ‘DNA barcoding’ as a way to identify species.

I don’t want to deny Paul Hebert’s contribution, nor that of the barcoding organisations. They have together popularised, formalised, extended and refined DNA barcoding. DNA barcoding is a force for good in the world and they have explained it beautifully to many diverse biologists, gained funding for several large studies, and refined the methodologies. Good for them.

I would like someone unconnected to the international barcoding groups to write a history of the discipline in a broad context, not just the projects labelling themselves ‘DNA barcoding’. The origins of the methodology and approach probably lie with the bacterial 16S sequencers like Norm Pace. They used short standardised gene segments to identify species and although some bacterial projects were undoubtedly environmental surveys, assigning taxa into molecular clusters with little extra biological information, many others incorporated well-characterised reference strains, which is exactly what most people would describe as DNA barcoding. Jonathan Eisen has an article (“Barcoding” researchers keep ignoring microbes) of relevance here- make sure to read the comments. The first use of the exact term “DNA barcoding” is unclear to me, and may possibly be in classic Hebert paper (12614582), although Blaxter used something essentially the same in the title of his 2002 paper “Molecular barcodes for soil nematode identification” which also employed a short standardised segment of 18S rRNA (11972769). Although there are some who dismiss these sorts of similarity based groupings as ‘environmental surveys’ like those used for bacteria, Floyd et al  also use a phylogenetic approach to link their environmental sequence clusters (MOTUs) to known, classically-described species that have been identified through morphology and vouchers lodged in museums- see Fig 4 in Floyd et al 2002. This is DNA barcoding and differs from typical studies only in the reference locus used. Ritz and Trudgill (1999) cited Blaxter as talking about a ‘molecular bar-code’ a few years earlier in a 1999 publication (Ritz K and Trudgill DL 1999 Plant and Soil 212: 1–11).

Baker and Palumbi (1999) tree identifying whale meat samples by comparison to whale voucher specimen sequences.

So what about mtDNA studies? Well, I haven’t done real research, I’m just trying to remember stuff, and I would be delighted to hear of examples in the comments. It wouldn’t surprise me at all to find that John Avise’s group (pioneers of mtDNA analysis) had used mtDNA to match unknown samples to voucher specimens. They tended to use whole mtDNA and RFLPs though rather than sequencing, would that still count, what do you think? Certainly Silberman and Walsh (1364049) were identifying lobster larvae by RFLPs of PCR amplified rRNA early on, does that count? Alan Wilson’s lab developed some of the first ‘universal’ mtDNA primers used in ecology and evolution (2762322) and again I wouldn’t be surprised to learn that they had assigned unknown specimens to type by DNA barcoding. But they usually chose cytochrome b or 12S rRNA, so would that still count?

A classic DNA barcoding study was published in Science in 1994 (17801528). They took ‘whale’ meat samples from Japanese markets and tried to identify which species they really belonged to. This is almost identical to many classic DNA barcoding studies (10.1016/j.foodres.2008.07.005) in all but that they used a standardised section of the mitochondrial control region rather than COI. I could also mention Hoelzel (2001) “Shark fishing in a fin soup” who identified the species present in shark fin soup using cytb and NADH2 sequences compared to the database.

So what about COI? Folmer et al designed some of the earliest (and best) COI universal primers (7881515). These are great primers and still the most commonly used for DNA barcoding. I was unaware of the Folmer primers when I designed my own universal primers (Lunt 1994 PhD thesis)(8799733) and several labs were doing this. In Godfrey Hewitt’s lab at UEA we had up to that point been using conserved mtDNA primers from Richard Harrison’s lab at Cornell (they were in pairs named after US presidents and their wives). We weren’t barcoding, the primers were being used for phylogeography, phylogeny and molecular evolution studies. This background just illustrates that COI primers had been around and used widely in all types of evolutionary biology for over a decade before the famous Hebert et al 2003 paper. So had anyone used DNA sequencing of COI with universal primers to match unknown specimens to described vouchered species? Had anyone used this approach to discover and describe cryptic species (another important aspect of DNA barcoding)? Definitely, probably lots of people! A study I designed with Africa Gomez an published in 2002 did exactly this (12206243). We had known rotifer isolates characterised by morphology, mating, ecology etc. We had lots of unknown eggs and identified them using a phylogenetic analysis of COI with the standard barcoding primers. Were we the first? Definitely not, we never thought for a minute that we were the first to do this, but I couldn’t tell you who was. Let me just repeat that, we were NOT the first, we did NOT invent DNA barcoding, not even in animals. I just wish people would stop claiming to have ‘invented’ DNA barcoding and instead understand the context in which their work stands. I doubt very much that DNA barcoding in any meaningful sense had a single origin. It was not a moment of inspiration, it was incremental change, as almost all scientific advance is.

If you know any good science journalists please buy them beers and persuade them to write the history of ‘DNA barcoding’ in the wide sense, and especially of the work of the bacterial 16S pioneers, I’d like to read that.


Dec 042012

 I have in front of me a copy of the book “Nucleotide sequences 1984 Part 1 A compilation from the GenBankTM and EMBL data libraries” published by IRL Press. Wow, what a surreal book for anyone used to dealing with sequence databases today. The idea that DNA sequences would be printed out, in an actual book made of paper, and put on a shelf for people to consult, takes some getting used to. To say that it is an idea that has passed is something of an understatement. I bought it for almost nothing as a curio, and it is going to sit proudly on my office shelves. I might even buy Part 2 to go with it.

The sequences range from 1967 to late 1983. The paper is not very white and slightly absorbant, not due to age I just think it was just published that way. It weighs 1.55kg and isn’t a large book. I’ve put a gallery of images below with the book next to a DNA double helix for scale! OK there is a baseball too, a strange collection of things just came to hand, apparently. Quite a number of sequences are very short (<100bp) and remind me of second gen sequence reads! Despite my incredulity at the start of this post, some of the ideas concerning open access to data, which are referred to in this book’s Introduction are very contemporary. The international sequence databases really have been important torch bearers for open access to research data for the last few decades.

There are some nice quotes in the Introduction

While computerized management of the data is needed to provide accuracy, easy maintenance, and electronic access, it is also important to publish the complete database in printed form. This first annual printed compendium effectively makes the entire collection of information available to every member of the scientific community who wishes to use it, including investigators without access to computers.

One of the goals of the collaboration between GenBank and EMBL is continued movement toward common standards and conventions for the two databases.

This compendium, drawn from the American and European databases, is the first printed compilation of substantially all nucleic acid sequences reported between 1967 and late 1983.

As combined in this compendium, the two databases contain a total of nearly three million bases from over 4000 reported sequences.

Yeast and fungal sequences are in the Plant Sequences section

The individual entries within each section are arranged alphabetically by entry name.

The records seem to be closer to EMBL format than GenBank, although Appendix E (which is in part 2) “illustrates how the format used in the compendium relates to the formats used in the two databases“. The sequences are grouped into mammalian, other vertebrate, invertebrate, plant, and organelle sequence lists. There is also a table of contents, one record per line, giving the length of the sequence and what page it is on.

The first sequence in the entire book is “APE (CHIMPANZEE) ALU TYPE DNA ACCESSION NUMBERS: J00322″ and the last is “YEAST (S. CEREVISIAE) MITOCHONDRIAL VAR1 GENE 3′ FLANK . ACCESION NUMBERS: K00385″

Google books seems to have scanned in the entirety of both volumes, but I couldn’t get it to work for me. What a fantastic book.

Nov 172012

Mendeley, Zotero, PapersCiteulike, and others are all playing the social reference management game. You store your PDFs in their excellent programmes and then you can start to be social; form groups, browse subject categories, subscribe to other people’s reference lists. My question is this: is the current implementation the best way forward for us, the users, or is it driven by the interests and/or old-fashioned thinking of the providers. For those of you just looking for the punchline, I think the answer is the latter, best for them, not us. We should be learning lessons from social networks, and the open-science movement, and adopting open standards before some FaceBook-like behemoth emerges and dominates social scientific reference management for the foreseeable future.

Social Reference Management

Reading scientific literature is at the heart of all science, and whoever you are there is always a little paranoia that you haven’t read the latest important thing in your area. The social web has a lot to contribute here. When I find an interesting reference I click to add it to my (say) Mendeley library, and subsequently post it to a shared group folder such as “Gene Duplication” or “Stuff Dave is Reading”. Other people can then follow that collection from within Mendeley, but outside of the Mendeley ecosystem it becomes much more difficult. I’m not picking on Mendeley here, they have done a lot of hard work in creating some excellent software, and they are a commercial company after all. But wouldn’t it be better for me, and for science, if I could subscribe to lots of people on lots of networks without barriers?

Social Networks

It’s almost impossible to consider this topic without introducing analogies to social networks like Google+, Facebook, and Twitter. In these areas however there has been a lot of talk about privacy, data ownership, and freedom of connectivity. If I am on FaceBook I can interact with other FaceBook users just fine, but how can I bring in someone on a different social network? I can’t, I’m locked in and able to operate only within the environment designed by FaceBook. Similarly there are issues of data ownership- FaceBook has famously caused some concern regarding who owns the photos that are uploaded. But there is another side to data ownership, that is not the ownership of the data itself but of the data’s organisation. I might be reluctant to leave a social network, not because I can’t keep a list of friends, but because I have dozens of people partitioned into different groups, organised so that I can follow work-related contacts in a different way from drinking buddies.

Open, Distributed, Semantic Social Networks

The web discussion of social networks have lead in one very promising direction- away from locked in “data silos” towards a consideration of open standards where different social networks speak the same language and you can communicate across networks. In addition there are much fewer concerns about what rights to the content you post you are giving away if that content stays on your server, or otherwise in your control. Similarly, if we rely on the export tools provided by a service provider, they are always likely to be poor. Why invest time developing tools to make it easier to leave your network? Sure you have to have an export option, otherwise you will get bad press, but the rabble won’t rouse up across the web just because your export is mediocre and loses much of the organisational content people have invested in. And its not like people can fix this themselves, because you control the software environment. There are many open distributed social networks including Diaspora, OneSocialWeb, gnusocial and Friendica although they have struggled to make significant inroads in terms of user base and some are no longer in active development. What if we used their code for open scientific bibliographic social networking?

Open Social Reference Management is important

I would like an open standard social system of reference sharing. This could be done within their current system, it might not even look that different to the present, it could certainly be built to match the look and feel of whatever reference software system you are in. But the important difference is that you could follow a group, or recommend reading to anyone, no matter what system they were using.

Yes this would be difficult, but not very difficult. The open licences of some of the existing social software could be challenging for commercial ventures to include (though perhaps not Zotero, which is open source). Linking the reference to the PDF could again be challenging, especially if your business model is based on selling user storage. There are many other things that could also be challenging, but hey, really big advances are always difficult. I currently find a lot of references on twitter and G+ but they are disconnected from my library of literature, sure they are social, but nothing more.

Hyperbole alert: Science is built on knowledge. Reading that paper, which sparks the idea or makes the link that eventually produces (name of the coolest advance you can think of). The group that builds the infrastructure that truly links scientists’ reading and social knowledge-sharing across the world will save lives, and protect biodiversity, and build rockets to Mars. What could be cooler, or more important than that? And all we need is Diaspora* for journal articles, to link Mendeley to Zotero, to CiteUlike to Papers, how hard is that really?


This is worth reading: “A flock of twitters: decentralized semantic microblogging