Lecture 10: Sequence alignment

Why align genetic sequences? 
Discover similarities and relatedness of proteins across and within species?  (Hopefully leading to better understanding of roles and mechanisms.)

Sequence homology:   Sequences share a common evolutionary ancestor.
  • Ortholog:  homologous sequence in different species -- similar biological function
  • Paralog: homologous sequence in resulting from gene duplication (duplicate is free to mutate and perform new functions).
(See Orthologs and paralogs --- we need to get it right.)

Homology properties:
  • Usually two homologous proteins have similar 3D structure
  • 3D structure diverges slowly during evolution
  • Usually (but not always) have significant sequence similarity.
  • Large enough sequence similarity usually implies homology

BLAST (Basic Local Alignment Search Technique)
  1. Identify short exact matches
  2. Extend best of these into longer regions of similarity
  3. Optimize

Example 1:  Use BLAST to compare two sequences  (NP_000509 and NP_005323).
  • The gap Existence penalty is the penalty creating a gap. Extension is for adding residues.
  • Look at the DOT matrix summary
  • Uses a local alignment scheme 
Global alignment (Needleman and Wunsch)
(align a sequence A of length n with a sequence B of length m)
  • Start with a scoring matrix (such as BLOSUM62) and an n+1 by m+1 alignment matrix.
  • Row 0 and column 0 have gap penalties (say -2).
  • Work way through from bottom left to upper right by computing scores successively.
  • Backtrack to find best border score.
Local alignment (Smith Waterman) - can start anywhere
  • Compute local scores, as above, but bound by 0
  • Once you find the highest scoring box, use it to backtrack and find alignment
BLAST Algorithm:  Apply heuristics between you want to search a whole database rather than just compare two sequences
  1. Get a list of possible fixed-length words from query sequence (length 3 for protein alignments, 11 for nucleotide alignment).
  2. Get a list of word pairs that meet a threshold --- for nucleotides -- it's exact matches 411
  3. Use an inverse index of possible words into exact database matches.
  4. The new version of BLAST prunes this list to require two separate non-overlapping word pair matches with a distance A.
HSP = high-scoring nucleotide pairs. 

The BLAST scores --- based on the likelihood that a BLAST score could arise from a chance retrieval from the database.

P(S >= x)  = 1 - exp(-Kmne-Lx)      (extreme value distribution)

m and n are the length of the sequences.  K and L depend on database and are determined empirically.

E = Kmne-LS     expected number of alignments in the database that have a score of S or better.

NCBI blast:
blastn:   nucelotide sequence query -> nucleotide sequence database
blastp:   amino acid sequence query -> protein sequence database
blastx:   all reading frames of nucleotide sequence query -> protein sequence database
tblastn:  protein sequnce query -> nucleotide sequence database (all reading frames)
tblastx:   six reading frame translations of nucleotide sequence query -> six reading frame translations of nucleotide sequence database