Comparative Sequence Analysis

This lecture series has 12 audio-lectures (slides with voice), each with 30 slides and each lasting 90 minutes.  Thus 18 hours in total.  The aim is to give the BIG PICTURE of the topic and hopefully also hint at open problems.  Each slide might summarize a paper, so technical detail is generally omitted, but references should be one each slide.   The Lectures were first given as ordinary 50 minute lectures with 15-20 slides each at Australian National University, Canberra.

Optimization Alignment - Pairwise.   I will in this first lecture, go through basic principles where you chose a “best” alignment using dynamical programming after providing an optimization function weighing changes of amino acid/nucleotide and insertion-deletions of small segments.  There are a variety of ways to accelerate the basic algorithm and to choose parameters.

Optimization Alignment – Multiple: Exact Algorithms.  In this lecture, I will discuss exact algorithms to align more than two sequences.  An algorithm for aligning n sequences related by a phylogeny was first published in 1973 by David Sankoff when indels had length 1.  This has since been generalized to longer insertion-deletions with a gap-cost g_k=ab*k. Accellerations to the basic algorithm has been proposed.  Variants of parametric alignment proposed by Kececioglu finding an alignment that has properties as close to a set of bench marks are discussed.

Optimization Alignment – Multiple: Heuristics.   I will discuss heuristic methods used to align a large number of sequences.  These are the programs that the users know.  There is no guarantee for their performance.  They are empirically tested against bench marks that is has been manually made to fit what a biologist would like often using extra data such as protein structure.  Some of these programs (like Clustal, MAFFT, PRANK,.) have developed over decades and articles describing them often have focus on how to run the program.  I will in this lecture try to list key tricks used in these programs allowing them to analyze such large data sets, classify the objectives of multiple alignment, go through some key programs and finally suggest some problems that might be worth to work on.

Alignment Variants: HMM, Profile, Local, Database Search, Alignment Free.   So far I have covered pairwise and multiple alignment optimizing some score function or defined via a heuristic algorithm. In this lecture I will cover the use of HMMs (Profile, Pairwise, Phylogenetics) as invented in the 90s.   Then, I will discuss sequence comparison, that does not need an alignment (alignment free methods).  A finish with local alignment, probability theory of sequence comparison and database search.

Indel Processes and the Basic Model (TKF91).    This and the next 2 lectures deals with stochastic models of insertion-deletions, which is an alternative to score based alignment methods.  This approach should appeal to statisticians.   This approach is more consistent approach to analysis of homologous sequences over the use optimization alignment, that uses optimization to deal with insertion-deletions, but statistics to deal with substitution events.  Statistical alignment gives a distribution on alignments and parameter estimation of all evolutionary parameters including insertion-deletion rates.

  The first statistical alignment paper was published in 1986 by Bishop and Thompson, but it was overshadowed by the model (TKF91) published in 1991 by Thorne, Kishino and Felsenstein that provided a more consistent model and thorough model. Since there there has been very few models compared to for instance models of amino acids or nucleotides.  In 1992 Thorne, Kishino and Felsenstein  published a model that allowed longer insertion-deletions under some very unbiological constraints. In 2004 Miklos, Lunter and Holmes published a more flexible model of long insertion-deletion. 

 In this lecture I will focus on the basic models and how to formulate a dynamical programming algorithm that can calculate the probability of 2 homologous sequences and define a distribution on all possible alignments.

 However, there only little effort goes into development of statistical alignment compared to other approaches.


Statistical Alignment – Long Indel and Irreversibility.   This and previous and next lecture deals with stochastic models of insertion-deletions, which is an alternative to score based alignment methods.   In this lecture I will focus on how to incorporate long insertion deletions in a stochastic model of sequence evolution.  There are very few papers on this and it is a difficult topic. Three papers are Miklos, Lunter and Holmes (2004) plus Karin, Ashkenazy, Hein and Pupko (2018) and hopefully Sanders, Golden and Hein (2019). MLH04 introduced a time reversible model of insertion-deletions and a dynamical programming algorithm for calculating the probability of a pair of sequences.  Several conceptual novelties had to be introduced to solve these problems: i. ChopZones [CZ] which is the analogue of a column in optimization alignment or p-functions in TKF91;ii. embedding the actual sequence in an imaginary infinitely long sequence, defining deletions as the time mirror-image of insertions to maintain time reversibility; iii. an effective way of summing over possible evolutionary paths in a CZ and more.  KAHP18 introduced a Knordian knot solution to path-integration and investigated how the maximal probability alignment compares to that of MAFFT and HOMSTRAD.  SGH19 relaxes the irreversibility assumption which can have major advantages.

Statistical Alignment - Multiple.   It might seem that going from pairwise statistical alignment to multiple phylogenetic statistical alignment would be an easy step, given that  the step from pairwise optimization alignment to multiple phylogenetic optimization l alignment was solved in 1973 only 3 years after that pairwise optimization alignment had been proposed.  However, this step is considerably harder and was first solved in 2001 by Mike Steel – 10 years after the TKF91 model.

 In this lecture, I will go through the basic dynamical programming algorithm and how it extends to other models that TKF91.  This algorithm only works for a few sequences and beyond that computational statistics methods must be used.

Alignment and Annotation.  Sequences can be annotated in a variety of ways.  Proteins by secondary structure categories, RNA sequences by which pairs of ribonucleotides bases pair in the actual physical structure, genomic DNA by genes, introns, regulatory signals and more.

  Computational annotation is typically done by assuming a hidden structure that influences which sequences are likely/  Like genes have different composition, a 3-periodic structure and more in contrast to non-coding regions have different characteristics.  The hidden structures mentioned in this lecture are GENES, REGULATORY SIGNALS, PROTEIN SECONDARY STRUCTURE, RNA SECONDARY STRUCTURE. Additionally, CG-Islands and Pedigrees is mentioned briefly.

  If multiple homologous versions are observed by the sequence, then the annotation will also influence the nature of the molecular evolution like for genes there typically will be avoidance of amino acid changing mutations. 

  The hidden structure is typically described by a Markov Model or a Stochastic Context-Free Grammar, but also more complex model are occasionally used.

 Insertion-deletions (and thus alignment) pose a serious problem for all these models.  For Markov Models the concept of predecessor is undermined since this varies over time.   Presently, there is not satisfactory solutions to this, but a series of ad-hoc methods do the job well.

Alignment beyond Sequences: Multigene Families.   In this lecture we discuss comparison of overall genome structure and the evolution of multigene families – both essential to genome analysis. Genome Structure Multigene Familes

Alignment beyond Sequences and Genomes.   In this lecture we discuss a long list of structures/quantities that can be treated by evolutionary comparison:  Gene Frequencies in Populations, Patterns, Manuscripts, RNA Structure, Protein Structure, Language, Shape, Networks and Expression Levels.  More could be added. 

Summary and Open Problems.  In this lecture I try to summarize the previous 11 lectures and sketch what I view as open and exciting problems.   There are plenty.