Lecture 18: Microarray analysis in GEO and in R

  • Lab 3 (brief discussion of the solution)
  • Microarray analysis
  • NCBI GEO to get microarray data
  • Microarray analysis in R
In the news:
New view of tumors' evolution: Sequencing of cancer cell genomes reveals potential new drug targets for an aggressive type of lung cancer

Microarray analysis principles:
  • After a complicated process of transformation and normalization (which everyone does differently), experimental data is reduced to a table of probes x expression
  • The expression levels only have meaning within a single series which have been normalized in the same way. 
  • Get results as a table of probes x expression (may or may not be expressed in terms of logs)
  • Three standard approaches:
    • Differential expression: Pick a pair of experimental conditions represented by the experiment and see which genes have significantly different expression levels. This allows you to discover which genes might be relevant or contribute to behavior in a particular experimental condition. Also look at whether any of these genes share common characteristics to infer mechanisms.
    • Gene profile clusters: Treat expression values for each gene (over samples) as a vector and find vectors that are most correlated by clustering. Genes grouped in the same clusters may share a pathway (because they are up-regulated or down-regulated together. Also look for what these groups have in common.
    • Models of gene expression: Fit expression of gene or groups of genes to linear model based on factors.  Fit sample vectors to a linear model based to find model of gene factors.
Benchmarked microarray results across platforms and labs and analysis methods.

Affymetrix is one of the main platforms. More recently Illumina  and to some extent Aligent platforms have become very popular.

Note: We will focus on Affymetrix platforms -- most deposited and consistent data so far. Generally we will stick to platforms GPL96 (human), GPL570 (human) GPL1261 (mouse).

Microarray analysis: 
There's so much stuff -- how do we access it all?  
  • Data that is deposited in GEO has little quality control.
  • Huge variation in results based on processing. 
  • Must take care to normalize all of the arrays that you are combining in the same way.
Some resources for R packages:
affy: http://www.bioconductor.org/packages/release/bioc/vignettes/affy/inst/doc/affy.pdf

SQL tutorials:

Preliminaries (if you haven't already done so)
if(!file.exists('GEOmetadb.sqlite')) {

Example 1: Connect to the local GEOmetadb metadata database
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')

Example 2: Get a list of the tables

Example 3: Get a list of the column names in the GDS table and the GDS_SUBSET tables
dbListFields(con, 'gds')
dbListFields(con, 'gds_subset')
sqliteQuickSQL(con, 'PRAGMA TABLE_INFO(gds_subset)')

Example 4: Get some of the subset rows
rs1 <- dbGetQuery(con,'select * from gds_subset limit 5')

Example 5: Get a list of the tables
dbListFields(con, 'gse')
rs2 <- dbGetQuery(con,'select * from gse limit 5')

Example 6: Get the series from platform GPL570
sql <- paste("SELECT DISTINCT gse.title,gse.gse",
             "FROM gse JOIN gse_gpl ON gse.gse = gse_gpl.gse",
             "WHERE gse_gpl.gpl LIKE '%GPL570%'", sep=" ")
rs <- dbGetQuery(con,sql)
In this example, the platforms aren't in the gse table. Rather we have to look them up in the gse_gpl table.

Example 7: Pick out the series from platform GPL570 that have one of the cancer keywords (cancer, carcinoma, malignant, neoplasm, tumor) in the title or summary of the series.
sql <- paste("SELECT DISTINCT gse.title,gse.gse",
             "FROM gse JOIN gse_gpl ON gse.gse = gse_gpl.gse",
             "WHERE gse_gpl.gpl LIKE '%GPL570%'", 
             "AND (gse.title LIKE '%cancer%' OR",
                  "gse.title LIKE '%carcinoma%' OR",
                  "gse.title LIKE '%maglignant%' OR",
                  "gse.title LIKE '%neoplasm%' OR",
                  "gse.title LIKE '%tumor%')",  sep=" ")
rs <- dbGetQuery(con,sql)

In this example we picked out the GPL570 series that had "cancer-type" keywords in the title.

Example 8: Disconnect from the database

Multiple comparison problem with tests of significance.  If you run t-tests for 10,000 pairs, some of these will be significant just by chance. This is called the multiple comparison problem. There are two standard approaches:
  1. Bonferroni correction:   adjust p-value for significance to 0.05/10,000  if there were 10,000 comparisons. (This is very conservative.)
  2. Adjust the false discovery rate (FDR)
FDR = # of false positives/# called significant