Lecture 13: Working with sequence alignments and gene discovery

  • Analysis of sequences
  • Look at conserved domains of your protein
  • Look at structure: https://www.ncbi.nlm.nih.gov/Structure/structure_discover.html
  • Install Cn3D macromolecular viewer: http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml

Getting data from NCBI

Preliminaries: If you are using Firefox, it would be helpful to go to the Options menu and set the downloads to always ask where to save.

Example 1:  TP53 is one of the more complicated proteins
  1. Searched NCBI Protein database for tp53 human (notice Search details box on right)
  2. Selected the link (at the top) for tp53 protein reference sequences
  3. TP53 has 15 identified isoforms --- select isoform A and download FASTA (notice Top organisms at top right)
  4. Also download isoform B in FASTA.
  5. Compare the features listed (Regions and Sites) with the annotations on the protein ideogram bar shown on tumorportal.org.
You can download the FASTA files for this example directly if you wish:

Example 2:  Read in the FASTA sequences for two of the isoforms of the TP53 proteins as string sets.  
tp53HA <- readAAStringSet("TP53HumanIsoformA.fasta")
tp53HB <- readAAStringSet("TP53HumanIsoformB.fasta")
tp53MA <- readAAStringSet("TP53MouseIsoformA.fasta")
tp53MB <- readAAStringSet("TP53MouseIsoformB.fasta")

Example 3: Do a global pairwise alignment using human isoform A as the subject.
tpGlobal <- pairwiseAlignment(c(tp53HB, tp53MA, tp53MB), tp53HA)

Example 4:  Explore the results of this alignment.  Relevant functions include the following. Some are applied to the alignment, while others to the pattern.
  • pattern 
  • subject
  • summary
  • score
  • nmatch
  • nmismatch
  • as.character
  • compareStrings
  • mismatch(pattern ...
  • insertion
  • deletion

Note:  here + means insert gap, ? means mismatch, and - means missing

Example 5:   Use the BLOSUM62 for alignment.
tpGlobalB62 <- pairwiseAlignment(tp53MA, tp53HA, substitutionMatrix = BLOSUM62)

Example 6: Relevant Genbank ID's and accession numbers for TP53:
  • RefSeqGene Human: NG_017013.2  Genbank ID 383209646
  • Reference CRCh38 (Genome Reference Consortium) Human: Genbank ID 568815581 
  • Reference GRCm38:  Genbank ID 372099099
  • Human TP53 Isoform A: NP_000537.3 GI:120407068
  • Mouse TP53 Isoform A: NP_035770.2 GI:148747262


Example 7:  Look at conserved domains
  1. Choose protein accession number for your gene:  NP_000537.3
  2. Search blastp in different groups -- e.g., primates, rodents, insects, bacteria, moss, etc. and understand what parts are conserved.
  3. Are there conservative domains for your protein/gene?
Notes on scores:
  • Raw score: computed from the substitution matrix (you can't compare scores across substitution matrices)
  • Bit score: normalizes the raw scores so that searches using different scoring matrices and different databases can be compared. 
  • E-score:  Expected number of alignments that will have this score or better --- just by chance. 
  • p-value:  1 - exp(-E)  probability of finding an alignment with a score of this level (E) or better. (Values of E < 0.05 are close to p-values).
Example: GenBank has an FTP site. A recent release notes gives you an idea of how this database is evolving.