Lecture 6: Populations and samples (continued)

Outline:
  1. Mutations
  2. Continue to work on R (data structures and combining)
  3. Emphasize vectorized or array operations.
Resources:
       Elementary statistics with R:  http://www.r-tutor.com/elementary-statistics
       Summary of R statistics functions covered in class: R statistics


Biology: Mutations 

Point mutations (replace one nucleotide with another --- (e.g., A <-> G) 
  1. Silent mutations - code for same/similar amino acid (no change in phenotype)
  2. Missense mutations - code for a different amino acid
  3. Nonsense mutations - code for stop -- can truncate protein

Insertions (add a few nucleotides) -- may cause frameshift or change way splice occurs

Deletions (remove a few nucleotides) - may cause a frameshift

In addition to "small scale mutations" we can get "large scale mutations" such as:
 gene duplications (multiple copies increases dose), 
 large-scale deletions 
 chromosomal translocations - an interchange between different chromosomes (not partner)
 loss of heterozygosity (LOH) - loss of an entire gene and surrounding region

Lawrence et al. (2014) paper:  
3M single nucleotide variations, 77K small insertions and deletions, 540K missense, 207k synonymous, 46K nonsense, 33K splice-site and 2.3M non coding mutations.

From last time populations and samples (looking at how well population mean,  variance, and sd are approximated by sample values):

Example 1:  Generate 1000 samples of size 10 of N(0, 1). How well does each of the samples predict the population mean? (Look at the histogram of the sample means as well as the maximum and minimum values.)

Example 2:  What is the population and sample variance of the 10000 values of Example 1 (Calculate using the vectorized form. and compare with the sample variance computed with the var function. Display the histogram of values).

Example 3: What is the actual or population variance of the 1000 samples of size 10 from Example 1?

Example 4: Show the estimates of the mean with error bars for the center points and standard deviations for the wings.  (R doesn't come with error bars built in.  There is an errbar function available in the Hmisc package:)

install.packages("Hmisc", dependencies=T)

library('Hmisc')

help(errbar)


What are vectorized operations and why use them?
  • Data languages such as R and MATLAB use tables as their primary data model.
  • Many functions and operators are overloaded to work on arrays:  A + B works whether A and B contain single values or arrays of the same size.
  • A language such as Java or C would write A + B as a loop. This has two drawbacks --- it obscures the structure and it can't be optimized very well.
  • Try to write code without using loops or ifs --- for better performance, clarity, and bug detection.

Understanding and estimating the population mean from samples of a fixed size:
  • We can use the sample mean to estimate the population mean. (It may not be a great estimate if the sample size is small.)
  • We can get an idea of the variability this type of estimate by looking at the distribution (histogram) of the means of a lot of samples of a fixed size:
    • The histogram is bell-shaped (sample means are normally distributed).
    • The mean of the collection of all the possible sample means is the true population mean. You won't ever have all of the possible samples, but if you have enough samples you can get a good estimate of the true population mean.
    • The standard deviation of the collection of all possible sample means is the standard error of the mean (SEM) = true population standard deviation / sqrt(sample size)
    • In the SEM formula we unfortunately usually don't know the true population standard deviation so we use the sample standard deviation to approximate it.
Understanding and estimating the population variance and standard deviation from samples of a fixed size:
  • The sample variance (the one with the sample size - 1 factor) can be used to estimate the variance of the population.
  • The sample standard deviation (the one with the sample size - 1 factor) can be used to estimate the standard deviation of the population.
  • We can get an idea of the variability of this type of estimate by looking at the distribution (histogram) of the variances of a lot of samples of fixed size.
    • The histogram is bell-shaped (more or less)