Lecture 16: What information is available

  • Some ggplot2 examples to help with Lab 3
  • Review what we did so far
  • Look at some of the other resources that are available and what they tell us

A couple of more ggplot2 examples: (See http://docs.ggplot2.org/current/

Example 1: Bar chart with rotated labels

ggplot(diamonds, aes(factor(cut))) +
  geom_bar()  +
  theme(text = element_text(size=12),
        axis.text.x = element_text(angle=90, vjust=1),
        plot.title=element_text(face="bold", size=14)) +
  xlab("Cut quality") +
  ggtitle("Diamond properties")

Example 2: Bar chart with stacked color
ggplot(diamonds, aes(factor(cut), fill=factor(color))) +
  geom_bar()  +
  theme(text = element_text(size=12),
        axis.text.x = element_text(angle=90, vjust=1),
        plot.title=element_text(face="bold", size=14)) +
  scale_fill_discrete(name="Color") +
  xlab("Cut quality") +
  ggtitle("Diamond properties")

Example 3: Bar chart faceted by clarity
qplot(cut,  data=diamonds, geom="bar") + facet_grid(~clarity) +
theme(text = element_text(size=12),
      axis.text.x = element_text(angle=90, vjust=1),
      plot.title=element_text(face="bold", size=14)) +
xlab("Cut quality") +
ggtitle("Diamond properties")

Some background: 
Gene locations are often given in p-q location. The p and q designate which of the chromosome arms and the number indicates which band and subband of that arm. See:  http://ghr.nlm.nih.gov/handbook/howgeneswork/genelocation

Accession numbers: There is a convention for accession numbers so that you can tell where the sequence was originally deposited (See: http://www.ncbi.nlm.nih.gov/Sequin/acc.html
Note: MGA = Marine group A is a candidate phylum of bacteria in the ocean http://www.ncbi.nlm.nih.gov/pubmed/23151638. WGS= whole genome shotgun assemblies
which are incomplete https://www.ncbi.nlm.nih.gov/genbank/wgs. The accession numbers with underbars generally correspond to reference sequences.)

Summary and review: 

Sequence information allows searching for similar sequences across species --- for conserved regions and homologous genes. 
  1. DNA sequencing (NCBI) 
  2. AA sequences (NCBI)
We have focused on genbank from NCBI. Genbank is following Moore's law and doubling every 18 months.

There is an International Database Collaboration -- so that three databases exchange information:
National Center for Biotechnology Information (NCBI): http://www.ncbi.nlm.nih.gov/
DNA Data Bank of Japan: http://www.ddbj.nig.ac.jp/
European Molecular Biology Laboratory: http://www.embl.org/

We did:
  • pairwise alignment (align two given sequences) get an unscaled similarity score
  • sequence similarity (BLAST) searches - do pairwise alignment across a database and get scaled E scores (indicating how probable a match this good could happen at random) 
  • multiple sequence alignment (align three or more sequences at the same time) --- looking for conserved regions and homology
The Nature article on cancer genes -- looked at the sequence information from approximately 4700 tumors and compared differences (mutations) from normal tissue with tumors. The tumor sequences were available from public databases (mostly the Cancer Genome Atlas). Virtually all genes have some mutations in cancer cells. The authors performed a sophisticated statistical analysis to show which genes experienced "significant" mutations.

Big biological question: since normal cells of a particular individual have the same genome, how can we explain the differences between behavior of different cells --- e.g., neurons in the brain versus skin cells.

Bottom line: The sequences are just the beginning of understanding mechanism. Must understand:
  • How much the gene is expressed (undergoes transcription to produce actual proteins)?
  • What are the transcription factors that cause the gene to be expressed?
  • What are the properties of the protein(s) produced by this gene?
  • What proteins interact with the protein(s) produced by this gene?
  • What pathways are the proteins in and what is the function of this protein?
  • What biological function does the protein have?
  • What diseases is this gene implicated in?
  • What drugs is this protein known to interact with?

What other information is available and important:

Omim (database hosted by NCBI but  maintained by Johns Hopkins): http://www.ncbi.nlm.nih.gov/omim
Genetic testing registry (NCBI) - http://www.ncbi.nlm.nih.gov/gtr/
Database of genotypes and phenotypes (NCBI - dbGap) - http://www.ncbi.nlm.nih.gov/gap

Gene expression using microarrays:
GEO profiles and datasets (NCBI):  http://www.ncbi.nlm.nih.gov/geo/
SAGE Serial analysis of gene expression: http://en.wikipedia.org/wiki/Serial_Analysis_of_Gene_Expression

RNA-seq (http://en.wikipedia.org/wiki/RNA-Seq):
ChiPBase: http://deepbase.sysu.edu.cn/chipbase/expression.php
EST expressed tag sequences (NCBI): http://www.ncbi.nlm.nih.gov/nucest
Unigene clusters of expressed transcripts (NCBI): http://www.ncbi.nlm.nih.gov/unigene
TCGA (Cancer genome atlas): http://cancergenome.nih.gov/
ENCODE (functional elements including regulatory elements): https://genome.ucsc.edu/ENCODE/

          Reactome: http://www.reactome.org
          Kegg (Kyoto Enclopedia of Genes and Genomes): http://www.genome.jp/kegg/
SMPDB (small molecule pathway database):  http://www.smpdb.ca/

Protein-protein interactions:
Database of Interacting Proteins (DIP)
Biomolecular Interaction Network Database (BIND)
Human Protein Reference Database (HPRD)
Biological General Repository for Interaction Datasets (BioGRID)

Protein structure and domains:

Drug interactions and small molecules