Lecture 22: Linear models and Lab 4 recap

Labororatory 4: Discovery with microarrays

Goals for today:
  • Understand linear models and errors.
  • Become more familiar with microarray data and the ideas of differential expression

There are many ways to accomplish a particular activity. For example, getting a count of the unique series and their samples. The following are three solutions (given by class members) are better than my solution.

Solution 1: Use aggregate
# Aggregate series from previous query results count function to use for
# aggregation
count <- function(x) sum(!is.na(x))
rsFBXW7GSE <- aggregate(rsFBXW7GSM$gsm, by = list(rsFBXW7GSM$gse), FUN = count)
names(rsFBXW7GSE) <- c("GSE", "Number of Samples")

Solution 2: Use merge

unique <- samples[!duplicated(samples[c("gse")]), ]
unique$gsm <- NULL
unique$title <- NULL

The number of samples in each series

freq <- as.data.frame(table(samples$gse))
colnames(freq) <- c("gse", "samples")
unique <- merge(unique, freq, by = "gse")
Solution 3: Use the count from the plyr package
## Warning: package 'plyr' was built under R version 3.0.3
p3Series <- count(p3Samps, vars = c("gse"))
p3Series <- p3Series[with(p3Series, order(p3Series$freq, decreasing = TRUE)), 
# getting rid of row.names for asthetic purposes
row.names(p3Series) <- seq(nrow(p3Series))
save(p3Series, file = "part3.RObj")
Solution 4: Do the calculation manually with a hash table 
This approach is not as good as the other approaches, but it illustrates how to use a hash table to do the calculation efficiently. This information is useful if the provided routines don't quite fit what you need.
R limma package
The limma package does linear models explicitly for microarrays: lmFit(object, design)
  • The design matrix has rows corresponding to the microarrays and columns corresponding to coefficients of the model.
  • The object can be an array or an ExpressionSet -- but should have log ratios or log values of expressions)

Simple linear regression: 
 y_i = \beta_0 + \beta_1 x_i +\epsilon_i, \,

In R, we would write a model y ~ x. The intercept is implied. If we wanted no intercept we would write y ~ x - 1 or y ~ x + 0.

Example 1: The trees data set has columns Volume, Height and Girth. How well does Girth predict Volume?

fit1 <- lm(Volume ~ Girt, trees)

The output of the summary is:
lm(formula = Volume ~ Girth, data = trees)

   Min     1Q Median     3Q    Max 
-8.065 -3.107  0.152  3.495  9.587 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
Girth         5.0659     0.2474   20.48  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 29 degrees of freedom
Multiple R-squared:  0.9353, Adjusted R-squared:  0.9331 
F-statistic: 419.4 on 1 and 29 DF,  p-value: < 2.2e-16

The predicted line is Volume ~ 5.0659*Girth - 36.9535

Example 2: Plot the data and the predicted line on the same graph
tcoeff <- coefficients(fit1)
plot(Volume~Girth, data=trees)
abline(tcoeff[[1]], tcoeff[[2]])

Example 3: Look at the errors to make sure that somewhat uniform across the variables
errors <- trees$Volume - tcoeff[[1]]- tcoeff[[2]]*trees$Girth
plot(trees$Girth, errors)

Example 4: Evaluate whether log(Volume) ~ log(Girth) is a better model 

Example 5: Evaluate whether log(Volume) ~ log(Girth) + log(Height) is a better model

Example 6: ANOVA = Analysis of variance looks at the SS_total (sum squares) 

ANOVA Columns are: 
  • Df = degrees of freedom - number of independent pieces of information available to estimate a parameter
  • Sum sq = sum of the square of the variables (SS)
  • Mean sq = SS/DF for that row
The F values are determined by the ratio the MeanSq of the variable divided by the MeanSq of the error. This has the F distribution. The p value gives how likely this ratio would occur at random (no effect). The partition of the Sum of squares error is derived in http://en.wikipedia.org/wiki/Partition_of_sums_of_squares

Example 7: Suppose x = [1, 2, 3, 4] and y = [1, 3, 3, 4.2]. Find the linear model and the ESS (explained sum of squared errors).

The Limma package has its only modeling and design.