Laboratory 3: Blasting in R

In this lab, you will be working your assigned gene and its DNA and protein sequence. You should have downloaded and set up your R studio to have the packages listed in Lecture 11. Be sure also that you have also installed MikTex and can produce a PDF of your lab as described in Lecture 9.

Download the FASTA representations of your gene (DNA) and its protein products (AA) for human as separate files.  Find the corresponding gene and protein (by name) for mouse and download those as well.  If the protein corresponding to your gene has multiple isoforms, then download the first or lowest number one (often isoforms are designated a, b, c, etc.) Also download the .MAF file for your gene from (as discussed in class).


Part 1:  Display the distributions of the different types of mutations as a function of cancer type using a ggplot2 faceted display.

Part 2:  Display distribution of specific SNP mutations as a function of cancer type using a ggplot2 faceted display.

Part 3:  Compare the alignment scores of global and local alignment using the default and the following substitution matrices:

for your gene in human and mouse using the AA sequences.  Store the results in an appropriate data.frame and use ggplot2 to display appropriately. 

Part 4:  Perform the BLOSUM62 protein alignment using blastp from NCBI with global alignment. Download the alignment results and compare with the results your computed results. 

Part 5: Repeat part 3 with the DNA versions of your gene.

Part 6:  Compare the number of insertions and deletions for the DNA versions of your gene using global alignments with the defaults and with BLOSUM62. 

Part 7: Write a report that has 6 paragraphs discussing the results of each of the 6 parts of this lab.

Zip up your project directory and upload to Blackboard. Also save a knit file. In your project directory, you should create both an HTML file and a PDF file. Your project directory should include all of the data needed to run the scripts. 

Some notes:
  1. If you try to actually place the SNPs in the chromosome using the tumorportal data, you may run into some difficulties because the positions don't correspond. This is because the Nature paper uses assembly GRch37 (hg19), but the current assembly listed on NCBI is GRCh38.  There are some differences. See for additional information about remapping.
  2. If your protein has multiple isoforms or sequences, try to use the GRC reference sequence and the lowest number isoform, e.g. isoform a.