E&E PhD Exit Seminar: Novel techniques for measuring the effect of neighbouring bases on mutation

Understanding the factors influencing mutation can improve mutation detection techniques, identify diagnostic signatures of disease-causing mutagens, and facilitate the development of more accurate models of genetic divergence. Hypermutability of CpG demonstrates the existence of mutation motifs, sequences of flanking bases that influence point mutation processes. These motifs can thus be indicative of specific mutation mechanisms. For my thesis, I have been developing novel log-linear models for identifying mutation motifs that further allows comparisons of these motifs, and of the complete mutation spectra, between samples. Mutation motifs are visualised using a sequence logo type method. I applied the methods to examine each of the possible 12 point mutations in ~13.6 million human germline mutations (inferred from SNPs recorded in ENSEMBL) and ~181 thousand melanoma mutations from the COSMIC database. I show that all point mutations have significant and distinct mutation motifs. While the major effects of flanking bases lie within 2bp of the mutated position, we refute previous reports that the effect magnitude decays monotically with distance. In addition, analyses of malignant melanoma confirmed reported characteristic features of this cancer, and its neighbouring influence is able to reflect the chemical influences of mutagenic processes after exposing to UV light. Therefore, we hypothesise that information regarding the mechanistic origin of point mutations is present in surrounding DNA sequence, and sequence neighborhood can be used to identify mechanistic origin of particular mutations. To assess this, I developed a machine learning classifier to discriminate between ENU-induced and spontaneous point mutations in the mouse germline. ENU is a synthetic chemical employed in mutagenesis studies, introducing novel point mutations to genomes. The classification results reveal that a combination of k-mer size and representation of second-order interactions among nucleotides was able to improve classification performance in comparison to the naive classifier approach. The results from my work have important implications for modelling context dependent effects on sequence evolution.