Mixture models for detecting whole-genome duplications "> Mixture models for detecting whole-genome duplications "> Mixture models for detecting whole-genome duplications " />
George Tiley

Molecular Evolution and Ecology

Software

Mixture models for detecting whole-genome duplications

I have implemented some mixture models in R that might be useful for detecting ancient whole-genome duplications from genomic or transcriptomic data

I never made this into a proper R package on CRAN because I do not think the code is that novel - it is mostly repackaging pre-existing algorithms to implement some slightly different models. The code has no dependencies, so it is easy enough to run with a source call. Models can be used to analyze any data for that matter, it does not have to be limited to the task of detecting whole-genome duplications.

Testing phylogenetic hypotheses of ancient whole-genome duplications

I implemented a simulation-based test for the placement of ancient whole-genome duplications on phylogenies based on summary statistics from reconciled gene trees.

Models exist that might be more interesting these days, but as far as I am aware, our approach is the best option for large-scale phylogenomic studies because it is fast. This should probably be used as a data exploration method and then test a candidate set of hypotheses with some of the more rigorous models.

Phasing target enrichment data from polyploids

Collaborators and I have implemented a pipeline for phasing sequence data from polyploids.

This is still under development but is capable of giving users analysis-ready output and hopefully avoids a number of bioinformatic headaches for biologists.

Simulations under the multispecies coalescent with introgression

It took a while for me to find a way to simulate data that allows for genealogical discordance under the MSC and introgression. BPP does this very well and here are some scripts that might be helpful for others. I found BPP to be much more intuitive than ms, but ms or fastsimcoal2 might be more appropriate to simulate under infinite sites for population genomics.