Lecture 15: Multiple alignments - continued

  • Some basic exploration using Interpro and PFAM
  • Some review of multiple sequence alignment from last time
  • Some information about R classes (which will be helpful in dealing with some of the packages we are using)
  • Accessing multiple alignments in R using Biostrings
Some resources:

Two news articles of interest:

PFAM:  Database of multiple sequence alignments:  http://pfam.sanger.ac.uk/
  • PFam-A proteins grouped into families
  • PFam-B large uncurated
Interpro: -- integrated multiple alignment tool.  Found at https://www.ebi.ac.uk/interpro/. This tool gives a lot of information about the function of the different pieces of the protein.  Look at the GO terms for your protein.

Some notes about R objects and functions

Generic functions
  • Encapsulates a generic concept like summary print.
  • Has a dispatch system which figures out actual method to call.
  • The dispatcher can get confused and call the wrong function.
  • Use methods function to get a list:   methods("summary)

S3 classes
  • Have attributes and methods  (for example data.frame) 
  • Can get the methods of a class with methods:   methods(class="data.frame")
  • Can overload generic methods for a class that you define:    print.data.frame
  • R doesn't have any validation for SE objects.
  • Does not follow OO principles
S4 classes  (See http://www.r-project.org/conferences/useR-2004/Keynotes/Leisch.pdf for details)
  • Must be created with a constructor new() which includes validation.
  • Supports inheritance
  • Object creation occurs with the setClass function:
setClass("rectangle", representation(length="numeric", width="numeric"))
myRect <- new("rectangle", length=10, width=5)
  • Finding out class information about an object:   class(myRect) or getClass(myRect)
  • Finding out which classes have a particular method:   showMethods(summary)

General information:
  • What tutorials are available for a particular package:  vignette(package="Biostirngs")
  • Read a particular tutorial: vignette("BiostringsQuickOverview")

Biostrings package:

  • XString virtual class with BString, DNAString, RNAString, and AAString subclasses.
  • XStringSet virtual class representing a set of strings, with BStringSetDNAStringSetRNAStringSet, and AAStringSet subclasses.
  • Input with readDNAStringSet (etc) with supported formats: 
  • Equality with == is intelligent (e.g., when an RNA string and a DNA string are compared, it will treat T and U as the same character).
  • XStringViews - class for storing set of start and end locations for an XString.
  • MultipleAlignment object - store multiple sequence alignments.
  • Input of multiple alignments:  readDNAMultipleAlignment, readRNAMultipleAlignment, readAAMultipleAlignment (with formats fasta, stockholm, clustral, or phylip)
Example 1:
x <- readDNAStringSet("TP53HumanGene.fasta")
y <- x[[1]]

Example 2:   I did a multiple alignment of the PTEN protein in humans to the taxonomy class bacteria using blastp.  When the results came back, I selected the multiple alignment and the expanded view of the alignments. Then I downloaded the multiple alignment to bacteria in FASTA format. (You can download this file at: https://googledrive.com/host/0B2dQ2-mnbQvALUsyTWlfbnZZNjQ/PTENMultAlignBacteria.fa).

ptenMA <- readAAMultipleAlignment(filepath="PTENMultAlignBacteria.fa")

Note:  Although the Biostrings documentation says that Clustal and Phylip formats can also be read, I was not able to read a Clustal or Phylip file downloaded from NCBI.

Example 3: Get the Accession numbers out for the items that aligned (add package 'stringsr')
p <- rownames(ptenMA)
pg <- str_split(p, "\\|")
pAccession <- sapply(pg, "[[", 2)

If you need samples to download:

Multiple alignments

  • Multiple alignment: three or more proteins that are completely or partially aligned.
  • Homologous residues align across sequences
  • Assumed to be derived from a common ancestor
  • Must have the collection of sequences that you want to align in hand.
  • Many different approaches to multiple alignment --- with different results
  • May use features such as--- highly conserved features, conserved motifs, or conserved features that result in understood secondary structures such as alpha helices, and beta sheets.
Uses of multiple alignments:
  • Detect homologs (same function) as opposed to just sequence similarity
  • Reveals conserved motifs or residues - these are usually the most biologically significant
  • Can provide insight into evolution and genetic structure
  • Conserved regions usually have 3D structure that is known --- this allows us to infer more information about 3D structure
  • Regions that are susceptible to genetic mutation can also give information about function
  • Multiple alignments are also used for determining the genetic mutations
  • Exact - use multi-dimensional dynamic programming --- time complexity grows exponentially with problem size
  • Progressive (CLUSTW) - start with the two closest alignments and add sequences --- uses pairwise global alignment as the starting point and depends on order added. (* = match, : = close substitution, red = alpha helices
  • Iterative (MUSCLE, PRALINE, INTERALIGN) - use a progressive alignment as a starting point 000 then use dynamic programming. 
  • Consistency (MAFFT, ProbCons, T-Coffee) - like progressive, but decide ordering on the fly.
  • Structure-based (PRALINE)  - use tertiary structure information in alignment.
General comments:  Need at least 25% similarity to work well.  All work well for close alignments.

Alignment of genomic regions:
  • Can also do multiple sequence alignment of genomic regions.
  • Often get regions of close agreement separated by long segments of no agreement.
  • DNA repeats are problem.
  • Chromosomal dislocation can involve millions of base pairs.
  • Usually assume existence of anchors or of an evolutionary tree.
Position-specific iterated BLAST (PSI-blast) -- for finding distantly related proteins:
  • Starts with blastp results.
  • Constructs multiple sequence alignments from resutls
  • Creates a position-specific scoring matrix (PSSM) based on results
  • Estimates significance of results
  • Repeats to improve results
PSSM - is a N x 20 matrix (N is length of query) -- eliminates hits with 94% sequence similarity to get non-redundancy.  Estimates the probability of animo acid substitution and adjusts the scoring matrices --- but must be careful

Manipulating sequences in R (seqinr)

Example 1: Using seqinr -- find out what's in the package

Example 2: Showing the genetic code

Note:  There are many slightly different genetic codes depending on the species. The following gives the vertebrate mitochondrial code:

Example 2:  Reading sequences that have been downloaded in as FASTA format.  (Save tp53 in FASTA and then read.)
tp53HA <- read.fasta("TP53HumanIsoformA.fasta", seqtype="AA")

Example 3:  Finding out information that is available through ACNUC 
gbank <- choosebank(bank="genbank", infobank=T)

Example 4: Get a virtual query to Hominidae and then get a list of species for this query (PS). Finally retrieve all the species below (SD)
query("hominidae", "SP=Hominidae", virtual = T)
query("hsp", "PS hominidae", virtual = T)

query("SDexample", "SD hsp")

Example 5: Queries are case insensitive and individual clauses are enclosed in double quotes

query("dengue", "\"sp=@virus@\" AND \"sp=@dengue@\" AND NOT \"k=partial\"", virtual=T)

Example 6:  Get nuclear DNA.

query("hsCDS", "sp=Homo sapiens AND t=cds AND o=nuclear AND NOT k=partial",
virtual = TRUE))

Query language:
  • Case insensitive
  • Wildcard: @
  • Logical operators: AND, OR, NOT
  • The virtual parameter allows queries to be created without being fetched.

TP53 accession:  NP_000537.3    and PTEN has accession number: XP_006717989.1