E&E PhD Exit Seminar: Mutations, Models and Probability
Prior to the advent of DNA sequencing technology, the progress of population genetics was constrained by the unavailability of molecular-level data to test competing hypotheses and models. Genomic data obtained from DNA sequencing has revealed that evolutionary processes in real populations diverge substantially from the simplifying assumptions of earlier models. I have applied more recent advances in data availability, statistical methods and computational technology to explore the consequences of relaxing the assumptions of earlier models to better reflect what has been learned about populations and the genome. I have focused on two main areas arising from the study of mutation: characterising intragenomic mutational heterogeneity and making inferences about the history of a population from variant data.
In the first instance, I address two primary causes of mutation rate heterogeneity: sequence context and recombination. Mutation rates have also been shown to vary, not only between different bases and mutation directions, but also with genomic location at scales ranging from individual nucleotides to multi-megabase sized regions. The patterns in the heterogeneity of mutation rates at varying scales provide can be used to increase understanding of factors influencing mutation and of the relative magnitude of their effects. I used the statistic of variance due to context to compare the effect on the probability of polymorphism of contexts of various sizes. I found that when the 12 point mutation directions are considered separately, variance due to context increases significantly as we move from 3-mer to 5-mer and from 5-mer to 7-mer contexts. However, when all mutations are considered in aggregate, these differences are outweighed by the effect of interaction between the central base and its immediate neighbours. I further show that failure to account for GC-biased has led to overestimation of the amount of heterogeneity in the human genome that is not explained by sequence context. I then calculated the variance due to recombination and the probability that a recombination event causes a mutation, employing statistical procedures used in the analysis of time series to take account of the spatial auto-correlation of recombination and mutation rates along the genome. My results support the view that genomic diversity in recombination hotspots arises largely from a direct effect of recombination on mutation rather than predominantly from the effect of selective sweeps.
I also investigated how polymorphism data can be used to make inferences about historical evolutionary influences without the assumptions on the demographic history of a population that have previously been required. I develop a method of testing for selection that can use null models incorporating such a demographic history and that benefits from the power of using the full likelihood of the null model. I compare this method to the well-known statistic Tajima's D and also use it to investigate some regions of the human genome that are candidates for the operation of natural selection. Methods for inferring the genealogical history of a sample from a population have also relied on the assumption that the population has maintained a constant size or some related constraint. I present a method of making such inferences without relying on assumptions of this type. This method makes use of Bayesian MCMC techniques with a novel approach to the selection of prior distributions.