Lecture 24: Clustering

Resources:
Clustering resources in R (CRAN Task View: Cluster Analysis & Finite Mixture Models): http://cran.cnr.berkeley.edu/web/views/Cluster.html
Wikipedia K-means article: http://en.wikipedia.org/wiki/K-means_clustering
W
ikipedia Hierarchical clustering article: http://en.wikipedia.org/wiki/Hierarchical_clustering



Basic clustering strategies:
  • Partitioning - dividing the data points into groups based on a criteria such as grouping by closeness (kmeans)
  • Hierarchical - start by grouping closest points and building up clusters based on distance (bottom up or agglomerative). There is also top-down.
  • Modeling - try to fit the best model (i.e., Gaussian to the data)

K-means clustering:
  1. Fix the number of clusters in advance
  2. Assign each data point to a cluster at random (random partition method)
  3. Recompute the cluster centers as the average (centroid) of the points in each cluster.
  4. Reassign the data points to the cluster with the closest centroid.
  5. Repeat steps 3 and 4 a certain number of times or until the clusters don't change.
Subtleties:
  • Most of the time it is not obvious how many clusters to use --- so often people will try many choices and look for best result.
  • Due to the random starting centroids, the results may change each time you run the algorithm. Typically, you will repeat the clustering many times, say 100 and then choose the one that has the smallest within cluster sum of squares distance to centroids.
  • Kmeans is NP hard --- don't get a global optimum.
Example 1: Go through the kmeans clustering algorithm by hand in one dimension with the points 1, 2, 3, 4, 6, 8 for two initial assignments of points and 2 clusters.

Example 2: Generate two distinct clusters of 50 two-dimensional vectors (Gaussian with mean 0 and Gausian with mean 1) and plot.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
plot(x[,1], x[,2])
points(0, 0, pch="+", col="blue", cex=3)
points(1, 1, pch="+", col="green", cex=3)

Example 3: Cluster the data from Example 2 using kmeans clustering with two clusters. Plot the results and look at the return values.
nclusts <- 2
cl <- kmeans(x, nclusts)
plot(x, col = cl$cluster)
points(cl$centers, col = 1:nclusts, pch = "x", cex = 3)
points(0, 0, pch="+", col="blue", cex=3)
points(1, 1, pch="+", col="green", cex=3)

Now adjust the code to try it with 3 clusters and with 4 clusters.

Example 4: Look at change in within-cluster variance as a function of number of clusters
wss <- rep(0, 15)
wss[1] <- (nrow(x)-1)*sum(apply(x,2,var))
for (k in 2:15) {
  wss[k] <- sum(kmeans(x, centers=k)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares") 


Hierarchical clustering:
Uses agglomeration to successively merge clusters.
  1. Initialization: Calculate the distance between all pairs of points. Initial cluster list are just single points. Set the levels L(0) = 0 and m = 0.
  2. Update step:
    • Find the pair of clusters that are closest (depends on which linkage method you choose). 
    • Merge the clusters into a new cluster and add to cluster list. 
    • Increment m and set L(m) to the cluster linkage distance.
    • Delete the rows and columns corresponding to the clusters you merged. Add a row and column corresponding to new cluster.
  3. If more than 1 cluster remains, repeat 2.
Linkage methods:
  • Complete - distance between clusters A and B is the maximum distance between a point from A and a point from B (goes for compact clusters)
  • Single - distance between clusters A and B is the minimum distance between a point from A and a point from B (friend-of-friend)
  • Average - distance between clusters A and B is the average of the distances between points in A and points in B
  • Ward - distance between clusters A and B is the increase in variance for clusters being merged (goes for compact spherical shaped clusters)
  • Centroid - distance between clusters A and B is the centroid of the cluster

Example 1:  Simple example (row numbers are displayed)
x <- c(1, 2, 3, 4, 6, 8)
hc0 <- hclust(dist(x), "complete")
plot(hc0)

Example 2: Simple example with row names
names(x) <- as.character(x)
hc1 <- hclust(dist(x), "complete")
plot(hc1)

Example 3:  Simple example using single linkage and labels down
hc2 <- hclust(dist(x), "single")
plot(hc2, hang = -1)

Example 4: More complicated example from manual (notice ^2)
require(graphics)
hc3 <- hclust(dist(USArrests)^2, "ward")
plot(hc3, hang = -1)