On Dangers of Indiscriminate Clustering

K-means and other algorithms for clustering are widely used for determining associations based upon standardized series of attributes. While there are a number of clustering algorithms, including some robust ones, some not-so-robust, some non-parametric, and some model-based, all share two problems.

First. it is necessary to estimate the number of clusters to be found. There are a number of devices for doing this, including examining the “scree plot” of the sums of square errors as additional components are added, orlooking at the singular values from an SVD (singular value decomposition) or other matrix decomposition of an associations or incidence matrix, or covariance matrix of the incidence matrix. However, finding a point in the scree which corresponds to the number of significant components is problematic. Professor Bill Press covers this somewhere in his Opinionated Lessons in Statistics series, but a formal way of determining this uses the nice algorithm by Zhu and Ghodsi, “Automatic dimensionality selection from the scree plot via the use of profile likelihood”, which appeared in Computational Statistics and Data Analysis in 2006, pp 918-930.

Second, even if clusters are found, there need to be criteria for determining whether these are real or not, or if this is just wishful thinking. It’s possible to go test these against the real world, but there would hope to be some better way, using data in hand. This is where model validation comes in, and I’ll have something to say about that in later posts.

Here, however, I want to illustrate what can happen if this kind of thing is not done.

Consider 100 hotels and 100 travelers. Having more users would be more realistic, but plots would get very dense and not illustrate the point as well. We want to determine clusters of associations between travelers and hotels. Perhaps we want to market to them. That’s in reality.

Here, however, I’m going to play a game. I’m going to say that the number of hotels a single traveler reports they stay at is governed by one plus the rounded value of a random Weibull distribution with shape 1.5 and scale 4. This gives us a nice and plausible distribution of stays by travelers in hotels. For example,

Beyond that, however, I’m going to give each and every hotel an equal chance of being chosen by a traveler so that in actuality there is no preference. What’s done next is to simulate the 100 travelers according to this, generating a bunch of transactions, and then clustering, first by estimating a number of clusters using Zhu and Ghodsi’s algorithm, and then using K-Means.

What happens?

This is the result of one run, but you can see … We have an estimate for a number of clusters, and we have clusters, and the clusters even look plausible. Presumably if actual travelers were followed, they would not be consistent with these clusters but the question is, how do we know whether or not their behavior is clustering about some real associations pattern, or is it just some other set of random associations.

We’ll come back to look at this at a later time.

Meanwhile, below is a PDF link to a set of clusters from 10 such simulations, and a PDF of the R code that was used to generate it.

Caution is advised!


(Slightly edited to provide better code on Saturday, 1st December 2012.)

# Hypothetical association of travelers and hotels, illustrating
# Jan Galkowski, empirical_bayesian@ieee.org
# 29th November 2012, last changed 1st December 2012

# (1) Assume 100 travelers and 100 hotels.
# (2) Assume number of distinct hotels per user is given by one plus a rounded Weibull distribution with shape 1.5 and scale 4.
# (3) Assume each and every hotel has an equal chance of participating in each user's selections.
# (4) Plan: (a) Generate a population according to this scheme;
# (b) use "kmeans" to cluster k-means, with random initial centers, Euclidean distance;
# (c) determine number of centers
# (d) present clustering.


NGenerations<- 10
NTravelers<- 100
NHotels<- 100
NumberOfClusters<- 8

pause.with.message<- function (message)
# Pause with a message between plots
cat("Paused. Press to continue...")

prOn<- function(fb, fx, w=11, h=8.5, basettl="", ttl="", onefile=TRUE)
# This directs all plot output to a PDF file set up in landscape orientation.
# The parameters fb and fx are two components of the PDF filename.
# The parameters basettl and ttl are two parts of a title on the PDF.
# Note if the same file is open in Acrobat or, pathologically, another R session,
# This will fail.
file<- sprintf("%s_%s", fb, fx)
fn<- sprintf("%s.pdf", file)
composite.ttl<- sprintf("%s %s", basettl, ttl)
pdf(file=fn, width=w, height=h, onefile=onefile, family="Helvetica", title=composite.ttl,
paper="special", bg="white", compress=TRUE)
cat(sprintf("Printing '%s' to PDF: '%s' ...\n", composite.ttl, fn))

prOff<- function()
# After a "prOn" is done, this closes the fie.

GaussianModel<- function(d, mu, sigma) dnorm(x=d, mean=mu, sd=sigma, log=TRUE)

GammaModel<- function(d, mu, sigma)
theta<- sigma*sigma/mu
k<- mu/theta
return(dgamma(x=d, shape=k, scale=theta, log=TRUE))

BetaModel<- function(d, mu, sigma)
v<- sigma*sigma
stopifnot( 0 != mu )
stopifnot( 0 != (mu - 1) * mu * (mu*mu - mu + v) )
nu<- (mu*mu - mu + v)/v
alpha<- -mu*nu
beta<- (mu - 1)*nu
print(c(alpha, beta))
return(dbeta(x=d, shape1=alpha, shape2=beta, log=TRUE))

rankDiscovery<- function(x.given, plotting=TRUE, ttl="", columns=NULL, model=GaussianModel)
# Rank discovery based upon the algorithm of Zhu and Ghodsi, "Automatic dimensionality
# selection from the scree plot via the use of profile likelihood", COMPUTATIONAL STATISTICS
# AND DATA ANALYSIS, 51 (2006), 918-930.
stopifnot( is.matrix(x.given) )
if (is.null(columns))
x<- x.given
x<- x.given[,columns]
n<- nrow(x)
m<- ncol(x)
decomposition<- svd(x)
singular.values<- decomposition$d
stopifnot( !is.unsorted(rev(singular.values)) )
p<- length(singular.values)
# print(singular.values)
p_1= p)
return(list(choice=NA, likelihood=NA, likelihoods=NA, singular.values=singular.values))
stopifnot( 2 < p )
likelihoods<- rep(x=0, times=p)
mu.1<- singular.values[1]
mu.2<- sum(singular.values[2:p])/(p-1)
sigma<- sqrt((p - 2)*var(singular.values[2:p])/(p - 2))
likelihood.1<- sum(sapply(X=singular.values[1], FUN=function(d) model(d, mu.1, sigma)))
likelihood.2<- sum(sapply(X=singular.values[2:p], FUN=function(d) model(d, mu.2, sigma)))
likelihoods[1]<- likelihood.1+likelihood.2
for (k in (3:p_1))
q<- k-1
i<- 1:q
j<- (1+q):p
mu.1<- sum(singular.values[i])/q
mu.2<- sum(singular.values[j])/(p-q)
sigma<- sqrt(((q - 1)*var(singular.values[i]) + (p - q - 1)*var(singular.values[j]))/(p - 2))
likelihood.1<- sum(sapply(X=singular.values[i], FUN=function(d) model(d, mu.1, sigma)))
likelihood.2<- sum(sapply(X=singular.values[j], FUN=function(d) model(d, mu.2, sigma)))
likelihoods[q]<- likelihood.1+likelihood.2
mu.1<- sum(singular.values[1:p_1])/p_1
mu.2<- singular.values[p]
sigma<- sqrt(var(singular.values[1:p_1]))
likelihood.1<- sum(sapply(X=singular.values[1:p_1], FUN=function(d) model(d, mu.1, sigma)))
likelihood.2<- sum(sapply(X=singular.values[p], FUN=function(d) model(d, mu.2, sigma)))
likelihoods[p_1]<- likelihood.1+likelihood.2
mu.1<- mean(singular.values)
sigma<- sd(singular.values)
likelihood.1<- sum(sapply(X=singular.values, FUN=function(d) model(d, mu.1, sigma)))
likelihoods[p]<- likelihood.1
choice<- which.max(likelihoods)
ratio.singular.values<- singular.values/singular.values[1]
if (plotting)
plot(1:p, ratio.singular.values, pch=22, col="blue", type="p", cex=2, main=sprintf("Scree Plot%s", ttl),
xlab="Dimension", ylab="Ratio of Singular Values to Largest", xaxt="n")
axis(side=1, at=1:p, cex.axis=0.7)
abline(v=choice, lwd=2, col="red", lty=6)
pause.with.message("Scree plot")
plot(1:p, likelihoods, pch=21, col="purple", type="l", lwd=2, lty=1, main=sprintf("Profile Log-LikelihoodR%s", ttl),
xlab="Dimension", ylab="Log-Likelihood", xaxt="n")
axis(side=1, at=1:p, cex.axis=0.7)
abline(v=choice, lwd=2, col="red", lty=6)
pause.with.message("Profile Log-Likelihood")
return(list(choice=choice, likelihood=likelihoods[choice], likelihoods=likelihoods, singular.values=singular.values))

asAssociationsMatrix<- function(transactions)
stopifnot( is.matrix(transactions) )
n<- nrow(transactions)
stopifnot( 2 == ncol(transactions) )
travelers<- sort(unique(transactions[,1]))
hotels<- sort(unique(transactions[,2]))
n.travelers<- max(transactions[,1])
n.hotels<- max(transactions[,2])
associations<- matrix(0, nrow=n.travelers, ncol=n.hotels)
for (k in (1:n))
associations[transactions[k,1], transactions[k,2]]<- 1

if (FALSE)

prOn(fb="Clusters", fx="FromNoise", basettl="Clusters can be found in noise", ttl="")

for (kGen in (1:NGenerations))
HotelsPerTraveler<- 1+round(rweibull(n=NTravelers, shape=1.5, scale=4))
NAssociations<- sum(HotelsPerTraveler)
TravelersAgainstHotels<- matrix(0, nrow=NAssociations, ncol=2)
j<- 0
for (k in (1:NTravelers))
n<- HotelsPerTraveler[k]
# Better sampling scheme.
TravelersAgainstHotels[((1+j):(j+n)),2]<- sample.int(n=NHotels, size=n, replace=TRUE, prob=rep(x=1/NHotels, times=NHotels))
TravelersAgainstHotels[((1+j):(j+n)),1]<- rep(x=k, times=n)
j<- n+j
colnames(TravelersAgainstHotels)<- c("TravelerNumber", "HotelNumber")
matrixTravelerAgainstHotels<- asAssociationsMatrix(TravelersAgainstHotels)
matrixToAnalyze<- matrixTravelerAgainstHotels %*% t(matrixTravelerAgainstHotels)
r<- rankDiscovery(x.given=matrixToAnalyze, plotting=FALSE, ttl="Travelers vs Hotels", columns=NULL, model=GaussianModel)
cat(sprintf("Number of estimated clusters: %.0f\n", r$choice))
clustering<- kmeans(x=TravelersAgainstHotels, centers=r$choice, nstart=8, iter.max=100)
cat(sprintf("number of centers %.0f, within sum-of-squares %.1f\n", clustering$ncenters, clustering$tot.withinss))
clusplot(TravelersAgainstHotels, clustering$cluster, color=TRUE, shade=TRUE, labels=2, lines=0,
main=sprintf("Random Generation of Data for K-Means, case %.0f\n(Travelers against Hotels, number of clusters = %.0f)", kGen, r$choice))
# pause.with.message(sprintf("Generation %.0f ...\n", kGen))

# Basically, even with random associations between Travelers and Hotels, analysis will "reveal" a
# statistically significant number of clusters, and, then, if k-means is told that, it will go forward
# and produce clusters that look plausible.



if (FALSE)
prOn(fb="additional", fx="events", basettl="additional events", ttl="")

x<- seq(1,2000, 1)
y<- dweibull(x, shape=1.5, scale=500)*100000
plot(x,y, type="l", lwd=2, lty=1, col="blue", xlab="time (dimensionless)", ylab="additional number of users alerted to event", main="")


About hypergeometric

See http://www.linkedin.com/in/deepdevelopment/ and http://667-per-cm.net
This entry was posted in statistics and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s