There are two useful functions for clustering: hclust (for hierarchical clustering) and plclust (plotting cluster). Given a matrix:
> m = matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), ncol=2) > m [,1] [,2] [1,] 1 5 [2,] 2 4 [3,] 3 3 [4,] 4 2 [5,] 5 1 > hc = hclust(dist(m), method=”average”)
Here, “average” is the way we compute distance between clusters. In case of average, we compute the average distance between every possible pairs (a, b) where a and b is from two different clusters.
That’s easy part. What’s interesting is finding the cluster centroids. First, we decide on the number of clusters using plcust. Then,
> cutree(hc, 2) [1] 1 1 2 2 2
So, first two, (1,5) and (2,4) is cluster 1 and (3, 3), (4,2), (5, 1) is the cluster 2. To compute average separately for these two, we first get nice split vector in this way:
> list(rep(cutree(hc, 2), ncol(m)), col(m)) [[1]] [1] 1 1 2 2 2 1 1 2 2 2 [[2]] [,1] [,2] [1,] 1 2 [2,] 1 2 [3,] 1 2 [4,] 1 2 [5,] 1 2
Here, [[1]] has 1 1 2 2 2 1 1 2 2 2. It means that we have clusters in this way:
1 1 1 1 2 2 2 2 2 2
But, we want to separate two columns. Thus, we have [[2]]. To see this easily, we use tapply w/o specifying function:
> tapply(m, list(rep(cutree(hc, 2), ncol(m)), col(m))) [1] 1 1 2 2 2 3 3 4 4 4
As we can see, matrix m will be averaged like this:
1 3 1 3 2 4 2 4 2 4
If this still looks weird, consider the following example from R manual:
> ind <- list(c(1, 2, 2), c(“A”, “A”, “B”)) > ind [[1]] [1] 1 2 2 [[2]] [1] “A” “A” “B” > table(ind) ind.2 ind.1 A B 1 1 0 2 1 1 > tapply(1:3, ind) #-> the split vector [1] 1 2 4 > tapply(1:3, ind, sum) A B 1 1 NA 2 2 3
Finally, apply mean:
> tapply(m, list(rep(cutree(hc, 2), ncol(m)), col(m)), mean) 1 2 1 1.5 4.5 2 4.0 2.0
Our centroid, therefore, is (1.5, 4.5) and (4.0, 2.0). Now, we can use kmeans using the output of the above.
> hm = tapply(m, list(rep(cutree(hc, 2), ncol(m)), col(m)), mean) > kmeans(m, hm) K-means clustering with 2 clusters of sizes 2, 3 Cluster means: [,1] [,2] 1 1.5 4.5 2 4.0 2.0 Clustering vector: [1] 1 1 2 2 2 Within cluster sum of squares by cluster: [1] 1 4 (between_SS / total_SS = 75.0 %) Available components: [1] “cluster” ”centers” ”totss” ”withinss” [5] “tot.withinss” “betweenss” ”size”
Well, in this case, kmeans didn’t print different output when started with centers found from hclust.
Reference.
1. R manual on tapply.
2. 구자용, 박현진, 최대우, 김성수, Data Mining, KNOU Press.