Fog Creek Software
Discussion Board

how tough is it to duplicate clustering engine?

Vivisimo does on the fly document clustering of text data.
How tough would it be to develop a replica of something like the same which does on the fly-conceptual document clustering?
From a software engineering prospective, how to create a budget for this?

Saturday, November 01, 2003

The difficulty of implementing a cluster is dependent largely on what kind of clustering you need.  At the simplest level, you can say everything is read-only and then your cluster just becomes a cache.  Once you start updating data (documents), you need to figure out how each node in the cluster is going to get an update of the data.  If you need the update to happen in such a fashion that everyone sees the change within a single transaction, it gets even more fun.

Network/communication failures are the most fun of all.  Let's say you have to update every node in a single transaction.  You start the transaction, update the first two nodes... and then you get "connection refused" when trying to update the third.  What do you do here?  There are numerous ways to solve the problem, but it can be a challenge to pick and implement the appropriate one.

So.  Before saying "oh yeah, we can do that", think about what kind of clustering you need.  If your needs are simple enough, you can probably get by relatively easily with a homegrown solution.  (But beware of the ever-present feature creep.)  If your needs are more complex, unless you have really sharp people who (ideally) have some experience with and understanding of the problem domain, you're most likely better off using someone else's solution.

Creating a budget is simple.  Figure out the effort necessary (man-hours), and how much it's going to cost to pay the people, buy the software, etc. etc.  And don't forget the opportunity cost either: what would those people be working on if you use an existing solution?  Is that a better value for the company?  Of course here again the devil is in the details.  Every detail you miss adds time to the schedule, so knowing how much you don't know becomes an important part of scheduling -- the more you don't know, the more you pad the schedule.


Saturday, November 01, 2003

Clustering can mean distributed computing, rather
than machine learning of features, which is what you
mean. Please consider using a more explicit phrase.
The answer to your question is unfortunately "It depends",
because the difficulty is not in producing an algorithm,
but producing one that will work efficiently.
For the algorithm to work efficiently, it has to match the
characteristics of your data. Since I am not familiar
with the text you wish to mine, (or the system you wish to
clone) I cannot say.
Also, you may wish to target a web board for data mining enthusiasts or natural language programmers; they will know better about your specialty.

Saturday, November 01, 2003

>Clustering can mean distributed computing, rather
>than machine learning of features, which is what you
>mean. Please consider using a more explicit phrase.

Anon stated "document clustering of text data" and mentioned Vivisimo. I don't know how much more explicit he can get. Apparently some people in this thread can't read.

Saturday, November 01, 2003

From a software engineering perspective, you've got to get a good idea of what the requirements are. Then you get someone in the know to estimate what it would take to get these requirements. Then you go back to the requirements, then back to the person, until you converge on something.

Then you build a simple prototype, that can be _used_ by (not just shown to -- actually _used_ by) your potential users. Then you take the feedback and modify your spec. Possibly continue working on the prototype, until you're sure you know what the real specs are.

And only _then_ you start to realistically think about a budget for your project.

Clustering problem are hard. Although you can have very general purpose solutions, you often need lots of domain expertise and tweaks to get the expected results for a real world problem.

A robust starting point, which is relatively recent, is described in [ ]. I'm a great believer in everything that has an information-theoretic justification.

And, people, the use of the world "cluster" for data mining and data analysis predates the use of that word to describe distributed processing. Unless you feel authoritative, don't correct others. If you do feel authoritative about this, you should give that feeling up.

Ori Berger
Saturday, November 01, 2003

well the data is dynamic, gethered from the web by crawling, so I have no data patterns, how to adjust for that?

Sunday, November 02, 2003

*  Recent Topics

*  Fog Creek Home