Lecture 14: Multiple alignments

Outline:

Resources:
Installation: (If you haven't done so already)
source("http://bioconductor.org/biocLite.R")
require(BiocInstaller)
bioLite()
biocLite("Biostrings")
install.packages("ape", dependencies=TRUE)
install.packages("ade4", dependencies=TRUE)
install.packages("seqinr", dependencies=TRUE)

Sequence alignment in R:
Packages that allow you to perform and manipulate pairwise alignment and multiple-sequence alignments in R:

NCBI interpreting taxonomy reports

Example 1: Look at the Taxonomy, distance tree and multiple alignment reports for the NCBI Blastp for your protein using the accession number for your human isoform a. (For TP53 this accession number is:  NP_000537.3

For explanations of these reports see:
http://www.ncbi.nlm.nih.gov/blast/taxblasthelp.html

Example 2: Are there any bacterial proteins that are similar to these?  (Do another blastp --- here I use new Delta blast).  We get a large number of proteins that are similar in two of the domains. (Look at the taxonomy and organism reports.)  The best score comes from Sorangium cellulosum So0157-2   and has accession number YP_008147703.1

Example 3:  Are there any expressed genes in humans that are more similar than tp53 to the proteins found in Example 2?  Use tblastn with target organism homo sapiens and the EST database. Remember EST (Expressed Sequence Tag) is a short sub-sequence of cDNA --- which is a single strand complementary to mRNA --- so it measures genes that are actually expressed.  Sometimes you can find some new gene relationships that way.

Example 4: Review what the NCBI alignment notation means: Letter indicates a match, + indicates a mismatch but a positive score in the substitution matrix, - indicates a gap.


Multiple alignments

  • Multiple alignment: three or more proteins that are completely or partially aligned.
  • Homologous residues align across sequences
  • Assumed to be derived from a common ancestor
  • Must have the collection of sequences that you want to align in hand.
  • Many different approaches to multiple alignment --- with different results
  • May use features such as--- highly conserved features, conserved motifs, or conserved features that result in understood secondary structures such as alpha helices, and beta sheets.
Uses of multiple alignments:
  • Detect homologs (same function) as opposed to just sequence similarity
  • Reveals conserved motifs or residues
  • Can provide insight into evolution and genetic structure
Approaches:
  • Exact - use multi-dimensional dynamic programming --- time complexity grows exponentially with problem size
  • Progressive (CLUSTW) - start with the two closest alignments and add sequences --- uses pairwise global alignment as the starting point and depends on order added. (* = match, : = close substitution, red = alpha helices
  • Iterative (MUSCLE, PRALINE, INTERALIGN) - use a progressive alignment as a starting point 000 then use dynamic programming. 
  • Consistency (MAFFT, ProbCons, T-Coffee) - like progressive, but decide ordering on the fly.
  • Structure-based (PRALINE)  - use tertiary structure information in alignment.
General comments:  Need at least 25% similarity to work well.  All work well for close alignments.

PFAM:  Database of multiple sequence alignments:  http://pfam.sanger.ac.uk/
  • PFam-A proteins grouped into families
  • PFam-B large uncurated
Interpro: -- integrated multiple alignment tool.  Found at https://www.ebi.ac.uk/interpro/. This tool gives a lot of information about the function of the different pieces of the protein.  Look at the GO terms for your protein.


Alignment of genomic regions:
  • Can also do multiple sequence alignment of genomic regions.
  • Often get regions of close agreement separated by long segments of no agreement.
  • DNA repeats are problem.
  • Chromosomal dislocation can involve millions of base pairs.
  • Usually assume existence of anchors or of an evolutionary tree.
Position-specific iterated BLAST (PSI-blast) -- for finding distantly related proteins:
  • Starts with blastp results.
  • Constructs multiple sequence alignments from resutls
  • Creates a position-specific scoring matrix (PSSM) based on results
  • Estimates significance of results
  • Repeats to improve results
PSSM - is a N x 20 matrix (N is length of query) -- eliminates hits with 94% sequence similarity to get non-redundancy.  Estimates the probability of animo acid substitution and adjusts the scoring matrices --- but must be careful




Manipulating sequences in R (seqinr)

Example 1: Using seqinr -- find out what's in the package
library(seqinr)
lseqinr()

Example 2: Showing the genetic code
           tablecode()

Note:  There are many slightly different genetic codes depending on the species. The following gives the vertebrate mitochondrial code:
           tablecode(numcode=2)

Example 2:  Reading sequences that have been downloaded in as FASTA format.  (Save tp53 in FASTA and then read.)
tp53<-read.fasta(file="tp53sequence.fasta")
tp53HA <- read.fasta("TP53HumanIsoformA.fasta", seqtype="AA")

Example 3:  Finding out information that is available through ACNUC 
choosebank()
gbank <- choosebank(bank="genbank", infobank=T)
ls()

Example 4: Get a virtual query to Hominidae and then get a list of species for this query (PS). Finally retrieve all the species below (SD)
query("hominidae", "SP=Hominidae", virtual = T)
query("hsp", "PS hominidae", virtual = T)
hsp$nelem

query("SDexample", "SD hsp")
getName(SDexample)


Example 5: Queries are case insensitive and individual clauses are enclosed in double quotes

query("dengue", "\"sp=@virus@\" AND \"sp=@dengue@\" AND NOT \"k=partial\"", virtual=T)


Example 6:  Get nuclear DNA.

query("hsCDS", "sp=Homo sapiens AND t=cds AND o=nuclear AND NOT k=partial",
virtual = TRUE))

Query language:
  • Case insensitive
  • Wildcard: @
  • Logical operators: AND, OR, NOT
  • The virtual parameter allows queries to be created without being fetched.