search engine results clustering when to stop

I am using the agglomerative clustering algorithm for clustering the output of a search engine. How do I know where to put the threshold so I get the optimal, or close to optimal clusters?

Thursday, June 19, 2003

I was looking at a similar abstract problem - how many clusters.  I did find a paper on the web that used bayesian trickery to work out how many clusters but it's under a heap somewhere.

I think the answer is to use a validity index such as Dunn's, or Davies-Bouldin.  To oversimplify, these measure how nicely the clusters are lumped together and how distinct they are.

There's a paper that explores these called 'Performance Evaluation of Some Clustering Algorithms and Validity Indices'.  Google for it and you can find a PDF for free.  IEEE have it on their site but you have to pay I believe.

As far as I know (which is only a bit), software such as SAS won't evaluate your clusters this way so you may need to do some homebrew stuff for this.  Could be OK, I don't know what environment you are working in.

Thursday, June 19, 2003

