Oct 172014
 

ReproPhylo treeWe are still largely missing the benefits of reproducibility in phylogenetics. I think that this makes our lives unnecessarily difficult and makes us particularly poorly prepared to confront modern data-rich phylogenetics. In this first post “Why” I want to talk about why we need reproducible phylogenetics. Then, in part two, “What“, I’m going to talk about some possible approaches to reproducible phylogenetics. In part three, “How“, I’m going to look at some existing software solutions. Lastly, in “ReproPhylo“, I’m going to write about the work my lab is doing to bring these approaches together to create new reproducible phylogenetics solutions.

tldr; Reproducibility helps us do better phylogenetics and do it more easily. There are a number of partial solutions out there. We introduce the ReproPhylo framework for easier + better reproducible experimental phylogenetics.

Why do we need reproducible phylogenetics?

Three short answers, then I explain below.

Answer 1: to do better quality science. This is achieved by being able to build on and extend other people’s work. It is also achieved by being able to take an experimental approach to phylogenetic methodologies.

Answer 2: to make your life much easier. The person most likely to reproduce your work is “future you”, making it easy to reproduce, and then modify, your analysis will save you lots of time. It will help you to do more actual research and less reformatting of files and coaxing belligerent applications to read them.

Answer 3: If your work isn’t fully reproducible is it really science? Sure its nice work that clarifies some important issues, you’re a bright person and its likely correct, but if its not reproducible …. what the hell are you thinking? Is this why you got into science? To accept stories from other scientists based on them being bright and it sounding right? You are much more sceptical than that about the science you read, so shouldn’t people also be sceptical of your work. Yeah, exactly.

Standing on the shoulders of giants

The ability to extend the work of others, to stand on their ‘shoulders’ [1] and reach higher, is how progress is made. “Wow, if I just added these species to that tree, used their analytical approach, I could actually test [whatever]”. But can you add species and use that approach? Or do you have to start from scratch, collecting sequences from GenBank and trying to reproduce their work before extending it?

How much do you want your work to have impact going forward? Make it easy for people to extend your work and you will be influential.

‘Experimental’ phylogenetics

This refers to approaches where we test the influence of method, parameter choice, and data inclusion on our tree structure. How many studies have you seen where people explore parameter space exhaustively and explicitly compare the phylogenies produced. Not many I would guess. Any? The reason is that it is too difficult to experiment when manual approaches to phylogenetics are used. Have you ever experimented with alignment parameters in your MSA program of choice? Most phylogeneticists usually only run the defaults, you can check the Methods sections of papers to confirm this. If I have to align sequences with 6 different programs, each with 50 different combinations of parameters, and then compare some characteristics of the 300 alignments and resulting trees this will be a truly mammoth piece of work. If however I have an entirely reproducible pipeline that will iterate over parameter space and produce a detailed report with clear summaries of alignment characteristics and tree variability then this becomes not an exceptional piece of work but just something I would typically do before getting down to detailed analysis of the question at hand. If a reviewer or critic thinks I have chosen the wrong range of parameters to optimise they can simply add others and hit RUN on my pipeline to compare to my optimum values. The robustness of science improves.

Who will actually reproduce my work?

It could be anyone. What if Professor Big loves your work and wants to extend it, great! But the reality is that the person who will certainly need to reproduce and extend your work is Future You! Make life easy for yourself by starting out reproducibly, anyone else calling you a giant and wanting to stand on your shoulders is a bonus.

What makes you think phylogenetics isn’t pretty much OK now?

Long and painful experience makes me think that. Why don’t you try reproducing 3 phylogenetics results from papers and then your perspective will have changed. Can you get their data easily, or is it a list of Genbank Identifiers in a supplementary Word table that you then have to type in to NCBI website? Can you run their software? If its an old paper can you even find their software? Do you know the dependencies to run it? Do you know what version they ran? Do you know the parameters they used?  Are the default parameters now the same as then? Did they exactly record data transformations? Maybe they changed sequence names between the original files and the tree figure. Maybe some taxa were excluded as problematic. Maybe the alignment was manually improved. Maybe some alignment regions were masked. All that is fine, but do you know exactly what they did and how? Did they archive the final tree or only a picture of it? This would only allow you compare by visual inspection to see if you have reproduced a previous study. It is estimated that >60% of phylogenetics studies are ‘lost to science’ [2]. This is a problem.

What is Reproducibility?

I’m not going to cover the semantic differences between reproducibility, replication, repeatability etc. Here I take a practical view of reproducibility as a term used routinely to represent the above terms. I really like this video of Carole Goble explaining the concepts of reproducible research.

Reproducibility is the correct way to do science.

Reproducibility is so integral to what we consider the scientific process that it is hard even to make a counter case here, so I won’t really try. So why isn’t reproducibility the norm? Well a technically poor form of reproducibility is the norm, the Methods section of the journal article. Later in this series I suggest that technical challenges have prevented complete and efficient reproducibility in the past, it hasn’t been your fault, but now those challenges are pretty much solved (part3; How) and we should grasp and benefit from these new possibilities.

[1] Standing on the shoulders of giants. Wikipedia. Available: http://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants.

[2] Magee AF, May MR, Moore BR: The Dawn of Open Access to Phylogenetic Data. arXiv [q-bioPE] 2014. http://arxiv.org/abs/1405.6623

Sep 052014
 

I received an email from Alex Lang and he asked about my current use of Electronic Lab Notebooks

Hi Dave,

I’m a physics graduate student who just started using a WordPress based ELN. I really found your thoughts on ELNs helpful, especially:
http://www.davelunt.net/evophylo/2009/03/wordpress-as-eln/
https://speakerdeck.com/davelunt/electronic-lab-notebooks-for-ug-students

Since someday I want to be a PI, I had some questions for you. If you would prefer to answer as a blog post, that would be fine by me.

I wondering if you could elaborate more on the mechanics of how you actually implement ELNs with your students. For example, some questions I had are:

Do students have an ELN on your website? Or do they host their own?
What happens when students leave the group? How do you keep a record of the ELN?
Does the whole lab share ELNs with each other? Or do you restrict it to you seeing students ELNs?
Do you ever share parts of the ELN with collaborators / outside people? Does that work out?

Thanks again for the insight into ELNs!

Alex

Hi Alex, thanks for the prompt to write something

Even though it was quite a while ago that I wrote my posts about ELNs most of it still holds for me. The post you link to above was talking about graduate students and postdocs, whereas the slides refer to undergraduates doing a short project but actually both are implemented in a similar way.

Do students have an ELN on your website? Or do they host their own?
What happens when students leave the group? How do you keep a record of the ELN?

I can answer these two together. I set up a blog ELN for all, and nobody sets up their own. Two reasons for this: Firstly, setting up a blog can be intimidating if you’ve never done it before, as many starting an ELN have not. Secondly, the blog is owned by me and can’t easily be deleted. This ensures I always have the experimental record even if researchers leave. I think this is important. Of course the researcher has admin rights and can always save a copy when they leave and take it with them, but I always retain the ELN record.

I host all ELNs on WordPress.com. The reason is that I trust them more than I trust my personal domain. You will have a reduced list of themes and plugins that can be used, but it is fast and robust.

Does the whole lab share ELNs with each other? Or do you restrict it to you seeing students ELNs?

Actually, only the project supervisor (me) usually sees the ELN. This is not really a decision, just how it worked out. I would be happy to let anyone in the lab see, and the people writing the ELN probably wouldn’t mind either, but nobody really wants to read the experimental minutiae apparently. We have regular lab meetings and this provides all the details most people want. Sometimes I will add a postdoc and a PhD student onto each other’s ELN if they are on similar projects, but generally people are happy without seeing details. In some ways I think this is a shame, we can all learn something from seeing how others do science and write up the record. I may try to change that and put everyone on every ELN by default.

Do you ever share parts of the ELN with collaborators / outside people? Does that work out?

I have shared an ELN with another co-I on a grant. It worked well, though it was not a major source of info for them (I don’t really know how often they read it). They preferred meetings where the researcher would summarise and discuss rather than reading the experimental record (which is sometimes a bit dry). Other co-Is have not really wanted to even see the ELN. I however quite enjoy enjoy browsing new work by my people and I am excited when I get a new post notification!

This brings me to something I hadn’t really thought much about before. Use of ELNs is not primarily a technology issue, it is a personality issue. If you don’t want to read a paper notebook, you won’t want to read an electronic one. Even if you don’t want to read other ELNs you should still keep one yourself:

  • It will help you in writing your manuscript. Some descriptions and methods will already have been written and only require copy/paste. I recently saw an excellent ELN post by my postdoc Amir that was a manuscript draft. Just explaining in the ELN what had been done and what conclusions could be drawn had created that first manuscript version.
  • It is more robust. Your leaking ice bucket cannot ruin the whole year’s experimental record. There is version control. It is backed up in the cloud and if you are wise it has a regularly saved local copy too.
  • The features of a WordPress ELN make it powerful. Search, tags, and categories make my day much easier and more productive.
  • I think it is just easier to keep information this way; easier to paste in text, and screenshots, and protocols, and web links. Even for people working at the bench rather than the computer I think it is easier.

My personal ELN is still very successful I think. I don’t research every day which makes the search function vital. I have had some minor failures but it is the best experimental record I have ever kept. My failures have taught me about the value of records with lots of searchable tags, the importance of explicit data file versions, and never to scribble something to ‘type up later’. The times the ELN gets flaky is when I’m too impatient and do the next thing before really creating a record for the last.

I have 2 students starting later this month, and I will set up a WordPress blog ELN for each. If anyone would like to add their views and experiences please leave comments, or email me.

 

Jan 242014
 

FileNotFound

I’m pretty proud of some parts of my workflow: electronic lab notebook, reproducibility, open data files, (semi-obsessive) automated data backups etc etc. But pride often comes before a fall. I had a bad experience this week where I thought I had lost some important phylogenetic data files (I found them eventually), and I’m writing this to work through what I did wrong, and what I need to change in my work routine.

About 4 years ago I built a phylogenetic tree of some cichlid fish. It was a small, relatively simple analysis, just to describe the phylogenetic relationship between some test fish species in a collaborative genomic experiment.  The pretty picture of the phylogeny was a manuscript figure, the analysis had been written up in my ELN as I did it, data files were backed up, no need to worry. Time passed, the paper stalled, and then went through tediously long review, but now I need to submit the files to Dryad and TreeBase ASAP. Fortunately I have the original sequence alignments, the notes on the analysis, and the tree file. Or do I?

This was a few years ago and the way I do phylogenies has changed quite a bit since then (subject of another post). My problem was that I had carried out three similar cichlid projects around the same time. Three cichlid phylogenies using the same genes to address three different questions. I had lots of iterations of file analyses for each data set as I iterated through analysis parameters and approaches. I had made at least two errors:

Sloppy naming

I had been very unimaginative in my file and folder naming schemes. Lots of things called cich_phylo1 or Bayes_new or cichlid_ND2. Lots of almost identical files in nested folders to search in order to find the one that had generated the figure. Someone I once worked with had the strategy of creating enormously long filenames detailing the analyses that had generated the file. It was impressive but I’m not sure I could do it, though it might work well for generated rather than manually created files. Maybe I’ll adopt it for folder names?

Full file paths

But surely if I had a good experimental record it would identify the exact file I was dealing with. Yes, it does, though I shamefully often neglected to give the full file path. It shouldn’t have been a problem but the analysis was actually done in a very unusual location on my HD (a folder shared with a linux partition) and falling outside my search. The reason I failed to give full file paths was that I was working fast, to a short deadline, and running more than one analysis simultaneously. I was pasting data details, ML parameters, results, and next moves, into my lab book but I was doing it all too fast. If I had an easy way to insert full paths I might have used it, but I didn’t. In OSX you can create shortcuts to copy the full file path and this is now only a right click away, so I have no excuse for not saving full paths into my ELN now.

NB, file modification dates do not always persist

Also, although I had the creation date of the files from my ELN, I wasn’t confident that the dates still persisted. OSX changes modification dates sometimes for no reason. Open a file, read, close- bam! its folder now has today’s date. Also some backups I did a few years ago used ftp (by accident) obliterating the file dates. Although the files in question did have the right dates, I had realised early on in my search that I couldn’t rely on that.

What would have helped?

  • Full file paths
  • Unique descriptive names, of folders if not files as well
  • Annotation of ELN blog posts with a “submitted file” tag or some way to differentiate experimental iterations from the final thing.
  • Writing a text file for the final folder detailing the publication of files (effectively tagging the final version for system searches)
  • Uploading final files immediately to Figshare. This would have given them each a doi, and that could have be put in the figure legend immediately. Useful.

Reproducibility

This all speaks to a bigger issue- reproducibility. In documenting all my analyses I was relying entirely on myself and my ‘good behaviour’. But one bad day, one error, one interruption by a student just as you are writing something, and the record is lost. This is not an issue however with workflows that automatically generate reports along with the results and figures. Here the generation of a complete and detailed record never fails, as long as the script is working correctly. This is the way I now try to do things.

My final point would be: you don’t have a real backup system until you’ve tested it in the real world. A thought experiment of how you would recover isn’t going to be enough. I thought I was fine, but I nearly wasn’t.

Mar 102013
 

Error_404I’ve been thinking about sustainable and accessible archiving of bioinformatics software, I’m pretty scandalized at the current state of affairs, and had a bit of a complain about it before. I thought I’d post some links to other people’s ideas and talk a bit about the situation and action that is needed right now.

Casey Bergman wrote an excellent blog post (read the comments too) and created the BioinformaticsArchive on GitHub. There is a Storify of tweets on this topic.

Hilmar Lapp posted on G+ on the similarity of bioinformatics software persistence to the DataDryad archiving policy implemented by a collection of evolutionary biology journals. That policy change is described in a DataDryad blog post here: http://blog.datadryad.org/2011/01/14/journals-implement-data-archiving-policy/ and the policies with links to the journal editorials here http://datadryad.org/pages/jdap

The journal Computers & Geosciences has a code archiving policy and provides author instructions (PDF) for uploading code when the paper is accepted.

So this is all very nice, many people seem to agree its important, but what is actually happening? What can be done? Well Casey has led the way with action rather than just words by forking public GitHub repositories mentioned in article abstracts to BioinformaticsArchive. I really support this but we can’t rely on Casey to manage all this indefinitely, he has (aspirations) to have a life too!

What I would like to see

My thoughts aren’t very novel, others have put forward many of these ideas:

1. A publisher driven version of the Bioinformatics Archive

I would like to see bioinformatics journals taking a lead on this. Not just recommending but actually enforcing software archiving just as they enforce submission of sequence data to GenBank. A snapshot at time of publication is the minimum required. Even in cases where the code is not submitted (bad), an archive of the program binary so it can actually be found and used later is needed. Hosting on authors’ websites just isn’t good enough. There are good studies of how frequently URLs cited in the biomed literature decay with time (17238638) and the same is certainly true for links to software. Use of the standard code repositories is what we should expect for authors, just as we expect submission of sequence data to a standard repository not hosting on the authors’ website.

I think there is great merit to using a GitHub public repository owned by a consortium of publishers and maybe also academic community representatives. Discuss. An advantage of using a version control system like GitHub is that it would apply not too subtle pressure to host code rather than just the binary.

2. Redundancy to ensure persistence in the worst case scenario

Archive persistence and preventing deletion is a topic that needs careful consideration. Casey discusses this extensively; authors must be prevented from deleting the archive either intentionally or accidentally. If the public repository was owned by the journals’ “Bioinformatics Software Archiving Consortium” (I just made up this consortium, unfortunately it doesn’t exist) then authors could not delete the repository. Sure they could delete their own repository, but the fork at the community GitHub would remain. It is the permanent community fork that must be referenced in the manuscript, though a link to the authors’ perhaps more up to date code repository could be included in the archived publication snapshot via a wiki page, or README document.

Perhaps this archive could be mirrored to BitBucket or similar for added redundancy? FigShare and DataDryad could also be used for archiving, although it would be suboptimal re-inventing the wheel for code. I would like to see FigShare and DataDryad guys enter the discussion and offer advice since they are experts at data archiving.

3. The community to initiate actual action

A conversation with the publishers of bioinformatics software needs to be started right now. Even just PLOS, BMC, and Oxford Journals adopting a joint policy would establish a critical mass for bioinformatics software publishing. I think maybe an open letter signed by as many people as possible might convince these publishers. Pressure on Twitter and Google+ would help too, as it always does. Who can think of a cool hashtag? Though if anyone knows journal editors an exploratory email conversation might be very productive too. Technically this is not challenging, Casey did a version himself at BioinformaticsArchive. There is very little if any monetary cost to implementing this. It wouldn’t take long.

But can competing journals really be organised like this? Yes, absolutely for sure, there is clear precedent in the 2011 action of >30 ecology and evolutionary biology journals. Also, forward-looking journals will realize it is their interests to make this happen. By implementing this they will seem more modern and professional by comparison to journals not thinking along these lines. Researchers will see strict archiving policy as a reason to trust publications in those journals as more than just ephemeral vague descriptions. These will become the prestige journals, because ultimately we researchers determine what the good journals are.

So what next? Well I think gathering solid advice on good practice is important, but we also need action. I’d discussions with the relative journals ASAP. I’m really not sure if I’m the best person to do this, and there may be better ways of doing it than just blurting it all out in a blog like this, but we do need action soon. It feels like the days before GenBank, and I think we should be ashamed of maintaining this status quo.

Bibliography

Aug 272012
 

I’ve been reading a lot recently about reproducible research (RR) in bioinformatics on several blogs, and Google+ and Twitter. The idea is that it is important that someone is easily able to reproduce* your results (and even figures) from your publication using your provided code and data. I’ve been thinking that this is a movement that urgently needs to spread to phylogenetics research- Reproducible Research in Phylogenetics or RRphylo.

The current state of affairs

The problem is that although methods sections of phylogenetics papers are typically fairly clear, and probably provide all the information required to pretty much replicate the work, it would be a very time-consuming process- lots of hands on time, lots of manual data manipulation. Moreover, many of the settings are assumed (defaults) rather than explicitly specified. ‘Have I replicated what they did?’ would then be judged by a qualitative assessment of whether your tree looked like the one published, and thats not really good enough.

Why it matters

I’m assuming here that what is currently published will allow replication, though in some cases it might not, as Ross Mounce described in his blog. I have had experience of this too, but thats another long and painful story. So why does it matter? It matters because of the next step in research. If the result of your work is important you or someone else will probably want to add genes or species to the analysis, vary the parameters of phylogenetic reconstruction, or alignment, or the model of sequence evolution, or reassess the support values, or something else relevant. Unless the process of analysis has been explicitly characterised in a way that allows replication without extensive manual intervention and guesswork then this cannot be achieved. If you want your work to be a foundation for rapid advancement, rather than just a nice observation, then it must be done reproducibly.

What should be done?

Briefly; pipelines, workflows and/or script-based automation. It is quite possible to create a script-based workflow that recreates your analyses in their entirety, perhaps using one of the Open-Bio projects (e.g. BioPerl, BioPython) or DendroPy, or Hal. This would go from sequence download, manipulation and alignment (though this could be replaced by provision of an alignment file), to phylogenetic analysis, to tree annotation and visualisation. Such a script must, by definition, include all the parameters used at each step- and that is mostly why I prefer that it includes sequence retrieval and manipulation rather than an alignment file.

Phylogeneticists are sometimes much less comfortable with scripting than are bioinformaticians, but this is something that has no option but to change. The scale of phylogenomic data now appearing just cannot (and should not) be handled in the GUI packages that have typically enabled tree building (e.g. MEGA, DAMBE, Mesquite).

There is a certain attraction to GUI approaches though and something that may well increase are GUI workflow builders. The figure above is from Armadillo, which looks interesting, but unfortunately doesn’t seem to be able to save workflows in an accessible form, making it an inappropriate way forward. Galaxy is another good example, able to save standard workflows, but not (yet) well provisioned for phylogenetic analyses.

At the moment  a script-based approach linking together analyses is the best approach for RRphylo.

‘So you’ve been doing this, right?’

Err no, I’ve been bad, really I have. Some of my publications are very poor from the view of RRphylo. The reasons for that would take too long to go into, I’m not claiming the moral high ground here, but past mistakes do give me an appreciation of both the need and the challenges in implementation. I absolutely see that this is how I will have to do things in future, not just because (I hope) new community standards will demand it, but also because iterating over minor modifications of analysis is how good phylogenetics is done, and that is best implemented in an automated way.

Automated Methods sections for manuscripts?

One interesting idea, that is both convenient and rigorous, is to have the analysis pipeline write the first draft of your Methods section for you. An RRphylo script that fetched sequences from genbank, aligned them, built a phylogeny, and then annotated a tree figure should be able to describe this itself in a human-readable text format suitable for your manuscript.

The full Methods pipeline is archived at doi:12345678 and is described briefly as follows: sequences were downloaded from NCBI nucleotide database (27 August 2012) with the Entrez query cichlidae[ORGN] AND d-loop[TITL] and 300:1600[SLEN]. These 5,611 sequences were aligned with MUSCLE v1.2.3 with settings……

This is a brief and made up Methods that doesn’t contain the detail it could. Parameter values from the script have been inserted in blue. This sort of an output could also fit very well with the MIAPA project (Minimum Information About a Phylogenetic Analysis). NB it is not this methods information that needs to be distributed, it is the script that carried out these analyses (and produced this human-readable summary as an extra output) that is the real record of distribution.

Implementation

This won’t be implemented tomorrow, even if everyone immediately agrees with me that it is really important. It is much easier for most people to just write the same old methods section they always have- a general description of what they did that people in the field will understand. I went today to read a lot of Methods sections from phylogeny papers. Some were better than others in describing the important details, but none sounded relevant to the new era of large scale analysis. They sounded like historical legacies, which of course is true of scientific paper writing in general.

It will take a community embarrassment to effect a change; an embarrassment that even the best papers in the field are still vague, still passing the methodological burden to the next researcher, still amateur compared to modern bioinformatics papers, still ultimately irreproducible.

The major barrier to RRphylo is the need to write scripts, a skill with which many phylogeneticists are unfamiliar and uncomfortable. This may be helped by Galaxy or the like, allowing the easy GUI linking of phylogenetic modules and publication of a standard format workflows to MyExperiment (I think Galaxy is the future, for reasons I won’t go into here). Alternatively, maybe some cutting-edge labs will put together a lot of scripts and user-guides allowing even moderately computer literate phylogeneticists to piece together a reproducible workflow. Hal and DendroPy seem the places to start at present, and I shall have to try them out as soon as I can. Other places for workflows that are worth investigating are Ruffus, Sumatra, and Snakemake. At the moment I’ve done a decent amount of Googling and absolutely no testing so I’d be really interested in other suggestions and views on these options.

I think that Reproducible Research in Phylogenetics is incredibly important. Not just to record exactly what you did, not just to make things easier for the next researcher, but because all science should be fully reproducible- of course it should. I’m coming round to the idea that not implementing RRphylo is equivalent to not releasing your new analysis package and just describing what the program does. But maybe I’m just a lone voice?

See also 

Our approach to replication in computational science, C Titus Brown

Reproducible Research: A Bioinformatics Case Study Robert Gentleman 2004

Next-generation sequencing data interpretation: enhancing reproducibility and accessibility‘ Nekrutenko & Taylor Nature Reviews Genetics 13, 667-672 (September 2012) | doi:10.1038/nrg3305 (subscription required)

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Goeks et al Genome Biology 2010, 11:R86 doi:10.1186/gb-2010-11-8-r86 (open access)

Reproducible research with bein, BiocodersHub

Sumatra also talks about pipelines acting as an Electronic Lab Book, here’s a presentation about it.

* I am not going to distinguish between replication and reproducibility in this blog post. See here. There are differences, but both are needed.