Lecture 21: Working with expression sets

Goal of today's lecture: Understand how to calculate differential microarray expression and what to do with it.

Strategy:
  • Measure gene expression under two (or more conditions) for example cancer or normal tissue
  • Find out which genes behave significantly differently under the two conditions
  • Look for relationships among these genes
  • Hopefully this will give a clue as to mechanism
Caveats:
  • You need to take care in selecting the conditions to contrast so that they make sense.
  • Samples taken from different tissues may have completely different behavior (cells differentiate or specialize).
  • You should always use a correction for multiple comparisons (usually false discovery rate).
  • Some genes are highly expressed no matter what (sometimes called housekeeping genes). It may be hard to detect differential expression for these genes, but they may play an important role anyway.
Couple of side comments:
Cell differentiation and stem cells http://en.wikipedia.org/wiki/Stem_cell.

Reminder of Lab 5 procedures:
  • Pick your series and enter it in the series section of ncbi.nih.gov/geo
  • Go to Geo2R. (If your series does not have a GEO2R you need to pick another series.)
  • Pick your groups (use CTRL-click and the mouse wheel).
  • Calculate the top 250.
  • Download the R Script.
  • Experiment with the conditions: you must have some genes (preferably at least 250) that are differentially expressed.

GEO uses a sophisticated statistical model based on linear modeling and Bayes corrections. The details of this method are beyond the scope of this course. However, we can apply the method and understand the results. The details of the method are described in the article:
Linear models and empirical Bayes methods for assessing differential expression in microarray experiments
by Gordon Smyth
http://www.statsci.org/smyth/pubs/ebayes.pdf



Some R preliminaries:
  • The save and load functions allow you to save all or part of your workspace to a file. As we get to bigger files this is useful.
  • Remember the @ is used to get pieces out of an object:  gset@featureData
  • ExpressionSet is the R class used for microarray analysis. It is derived from the eSet class which is a general purpose abstract container class for high-throughput data.
    • Slots or properties:
      • assayData - has features by samples measurements
      • phenoData - is an AnnotatedDataFrame containing information about the meaning of the microarray samples (columns)
      • featureData - is an AnnotatedDataFrame containing information about the meaning of the probes on each microarray (rows)
    • Functions or methods:
      • exprs - extracts the
  • AnnotatedDataFrame is the R class that contains metadata about the meaning of samples of a data collection
    • Slots or properties:
      • data is a dataframe containing 


Example 1: Download a series from NCBI GEO into an expression set
gset <- getGEO("GSE35896", GSEMatrix =TRUE)
gset <- gset[[1]] 
Note: The GSE35896 series contains a single platform (GPL570). The getGEO returns a list of elements so the second statement makes gset and ExpressionSet rather than a list of items. Also take 

Example 2: Fix the row labels (in fvarLabels) so that they are valid names
fvarLabels(gset) <- make.names(fvarLabels(gset))
Note: The make.names function converts strings into valid variable and column names.

Example 3: Download series GSE6537 that has both platforms GPL96 and GPL570 in it.  (You can download this from
gsetTest <- getGEO("GSE6357", GSEMatrix =TRUE)

Note: gsetTest is a two-element list with data in GSE6357-GPL96_series_matrix.txt.gz and GSE6357-GPL570_series_matrix.txt.gz.

Example 4: Separate the GSE6537 into to items, each an ExpressionSet.
g570 = gsetTest[[grep("GPL570", attr(gsetTest, "names")) ]]
g96 = gsetTest[[grep("GPL96", attr(gsetTest, "names")) ]]

Example 5: Extract the metadata identifying samples in the GSE35896 series that was previously downloaded.
samps <-gset@phenoData
sampInfo <- samps@data
 
Note: The sampInfo variable now contains an data frame containing the characteristics of each column

Example 6: Look at the row and column names of the sample metadata
attributes(sampInfo)

Example 7: Define a column vector with the sample subtypes in it
sampSub <- sampInfo$characteristics_ch1

Example 8: Download the NCBI platform annotation and create a data.frame with the information
# load NCBI platform annotation
gpl <- annotation(gset)
platf <- getGEO(gpl, AnnotGPL=TRUE)
ncbifd <- data.frame(attr(dataTable(platf), "table"))

Example 9: Download the NCBI platform annotation and create a data.frame with the information
# load NCBI platform annotation
gpl <- annotation(gset)
platf <- getGEO(gpl, AnnotGPL=TRUE)
ncbifd <- data.frame(attr(dataTable(platf), "table"))