Clustering in R and using tapply for finding centroids – Passion is like genius; a miracle.

There are two useful functions for clustering: hclust (for hierarchical clustering) and plclust (plotting cluster). Given a matrix:

> m = matrix(c(1, 2, 3, 4, 5, 5, 4, 3, 2, 1), ncol=2)
> m
     [,1] [,2]
[1,]    1    5
[2,]    2    4
[3,]    3    3
[4,]    4    2
[5,]    5    1
> hc = hclust(dist(m), method=”average”)

Here, “average” is the way we compute distance between clusters. In case of average, we compute the average distance between every possible pairs (a, b) where a and b is from two different clusters.

That’s easy part. What’s interesting is finding the cluster centroids. First, we decide on the number of clusters using plcust. Then,

> cutree(hc, 2)
[1] 1 1 2 2 2

So, first two, (1,5) and (2,4) is cluster 1 and (3, 3), (4,2), (5, 1) is the cluster 2. To compute average separately for these two, we first get nice split vector in this way:

> list(rep(cutree(hc, 2), ncol(m)), col(m))
[[1]]
 [1] 1 1 2 2 2 1 1 2 2 2

[[2]]
     [,1] [,2]
[1,]    1    2
[2,]    1    2
[3,]    1    2
[4,]    1    2
[5,]    1    2

Here, [[1]] has 1 1 2 2 2 1 1 2 2 2. It means that we have clusters in this way:

But, we want to separate two columns. Thus, we have [[2]]. To see this easily, we use tapply w/o specifying function:

> tapply(m, list(rep(cutree(hc, 2), ncol(m)), col(m)))
 [1] 1 1 2 2 2 3 3 4 4 4

As we can see, matrix m will be averaged like this:

If this still looks weird, consider the following example from R manual:

> ind <- list(c(1, 2, 2), c(“A”, “A”, “B”))
> ind
[[1]]
[1] 1 2 2

[[2]]
[1] “A” “A” “B”

> table(ind)
     ind.2
ind.1 A B
    1 1 0
    2 1 1
> tapply(1:3, ind) #-> the split vector
[1] 1 2 4
> tapply(1:3, ind, sum)
  A  B
1 1 NA
2 2  3

Finally, apply mean:

> tapply(m, list(rep(cutree(hc, 2), ncol(m)), col(m)), mean)
    1   2
1 1.5 4.5
2 4.0 2.0

Our centroid, therefore, is (1.5, 4.5) and (4.0, 2.0). Now, we can use kmeans using the output of the above.

> hm = tapply(m, list(rep(cutree(hc, 2), ncol(m)), col(m)), mean)
> kmeans(m, hm)
K-means clustering with 2 clusters of sizes 2, 3

Cluster means:
  [,1] [,2]
1  1.5  4.5
2  4.0  2.0

Clustering vector:
[1] 1 1 2 2 2

Within cluster sum of squares by cluster:
[1] 1 4
 (between_SS / total_SS =  75.0 %)

Available components:

[1] “cluster”      ”centers”      ”totss”        ”withinss”
[5] “tot.withinss” “betweenss”    ”size”

Well, in this case, kmeans didn’t print different output when started with centers found from hclust.

Reference.
1. R manual on tapply.
2. 구자용, 박현진, 최대우, 김성수, Data Mining, KNOU Press.