search engine results clustering when to stop
I am using the agglomerative clustering algorithm for clustering the output of a search engine. How do I know where to put the threshold so I get the optimal, or close to optimal clusters?
Sumit
Thursday, June 19, 2003
I was looking at a similar abstract problem  how many clusters. I did find a paper on the web that used bayesian trickery to work out how many clusters but it's under a heap somewhere.
I think the answer is to use a validity index such as Dunn's, or DaviesBouldin. To oversimplify, these measure how nicely the clusters are lumped together and how distinct they are.
There's a paper that explores these called 'Performance Evaluation of Some Clustering Algorithms and Validity Indices'. Google for it and you can find a PDF for free. IEEE have it on their site but you have to pay I believe.
As far as I know (which is only a bit), software such as SAS won't evaluate your clusters this way so you may need to do some homebrew stuff for this. Could be OK, I don't know what environment you are working in.
Konrad
Thursday, June 19, 2003
Recent Topics
Fog Creek Home
