Annotation and Alignment

Summary:  Annotation and Alignment eight of 12 lectures on Biological Sequence Analysis [BSA], I give at ANU, Canberra, Australia in September to December 2018.     BSA is a huge field since sequences are presently so abundant.

Sequences can be annotated in a variety of ways.  Proteins by secondary structure categories, RNA sequences by which pairs of ribonucleotides bases pair in the actual physical structure, genomic DNA by genes, introns, regulatory signals and more.

Computational annotation is typically done by assuming a hidden structure that influences which sequences are likely. Like genes have different composition, a 3-periodic structure and more in contrast to non-coding regions have different characteristics.

 If multiple homologous versions are observed by the sequence, then the annotation will also influence the nature of the molecular evolution like for genes there typically will be avoidance of amino acid changing mutations. 

The hidden structure is typically described by Markov Model or a Stochastic Context-Free Grammar, but also more advanced models are occasionally used.

Insertion-deletions (and thus alignment) pose a serious problem for all these models.  For Markov Models the concept of predecessor is undermined since this varies over time when elements are inserted or deleted.   Presently, there is not satisfactory solutions to this, but a series of ad-hoc methods do the job well.

Preliminary slides can be found here.