Lecture 5: Populations and samples

Goals:
  • Understand the relationship between DNA, the gene, and the gene product.
  • Understand the measures of dispersion and why sample and population calculations of these are different.
Outline:
  1. Structure of DNA and genes (with a visit to NCBI website)
  2. Continue to work on R (data structures and combining)
  3. Measures of disperson: standard deviation, median absolute deviation, interquartile range.
Resources:
       Elementary statistics with R:  http://www.r-tutor.com/elementary-statistics
       Summary of R statistics functions covered in class: R statistics


Biology: Details of transcription

Schematic of RNA transcription
source: http://en.wikipedia.org/wiki/File:MRNA.svg

Details of translation:
Translation



Measures of dispersion

Example 1: Measures of dispersion: compare error, median absolute deviation (mad), mean squared error (variance), and root mean squared error (standard deviation) in estimating the data set points by the mean. (Do a simple example by hand using data 4, -1, 3, 0, -1.) 

Example 2: Compute measures of spread for the simple example using R (var, sd, IQR, mad)

Example 3: Compare measures of spread for the faithful data. Also display using a histogram (hist) and boxplot (boxplot).

The hand computation is the actual or population value.  What happens when you only have a sample of the data?

Probability distributions

Example 4: Use help(Distributions) to see what probability distribution functions are available in R.  For example, one common distribution is the normal distribution. What do the four forms of the function mean?

Example 5: Generate a sample of size 10,000 from N(0, 1).  What is the actual mean, variance and standard deviation of these values.  How well do the calculations match the original distribution?

Example 6:  Look at the values generated in Example 5 as 1000 samples of size 10.  How well does each of the samples predict the population mean? (Look at the histogram of the sample means as well as the maximum and minimum values.)

Example 7:  What is the actual variance of the sample of size 10,0000  from Example 5. (Calculate using the vectorized form. and compare with the sample variance computed with the var function. Display the histogram of values).f

Example 8: What is the actual variance of the 1000 samples of size 10?  What is the sample variance of each of the samples of size 10 from Example 6?  

Example 9: Show the estimates of the mean with error bars for the center points and standard deviations for the wings.  (R doesn't come with error bars built in.  There is an errbar function available in the Hmisc package:)

install.packages("Hmisc", dependencies=T)

library('Hmisc')

help(errbar)



Key points:
  • Sample (a randomly drawn subset) can be used to estimate the entire population. The larger the sample, the better the estimate.
  • Mean of sample is used to estimate the mean of the sample. This is called a point estimate of the population mean.
  • Unbiased estimator (the version with n-1 rather than n in the denominator) is used to estimate population variance or standard deviation from the sample.
  • Sample MAD has a scale factor which depends on how the numbers are distributed.
  • If we take a lot of samples each of size N and look at the distribution of estimates of the population mean calculated from these samples, we get a normal distribution centered around the true population mean. The standard deviation of this distribution of estimates has a standard deviation which is the true population standard deviation divided by the sqrt of N.  This is called the SEM or standard error of the mean.  It is widely used in biology for error bars. Since we usually don't know the true population standard deviation we use the unbiased estimator of the sd based on the sample in stead of the true population standard deviation.