data mining

Tell me - what is the best software for data mining.

Tuesday, July 22, 2003

Your brain...

It really depends on what you mean by "data mining".  It has become such a buzz word that it's impossible to tell what someone really means when they say it.

What exactly are you trying to do? How often? On how much data? What kind of data? How much automation do you need? Do you have someone that's prepared to do this more or less full time? How much money have you got?

Sorry, I'm crotchety, I'm bogged down doing "data minig" right now (we used to call it ad hoc reports, but calling it data mining means I get paid more).

Steve Barbour
Tuesday, July 22, 2003

I've got file (Statistica, but never mind, I can convert it to other formats) 300 000 cases - data from buying customers behaviour, and I'd like to identify patterns of behaviour, from what behaviour depend...

Tuesday, July 22, 2003

You can really spend what you want.

If you are talking about 'patterns' do you mean the following?

1. Association rules/sequences - identifying combinations of products that usually sell together (the typical beer and nappies story), or in a certain order

2. Predictive models - given known histories of purchases, predicting the likelihood of whether a group will make a purchase, or given past churning behaviour, whether a customer who hasn't churned is likely to within a certain period of time.

Your budget can go from 0 - 1m dollars with this one.  If you've got 300,000 cases you haven't really got a lot of data so unless you are going to encounter loads of new data, and lots of new scenarios I'd recommend spending as little as you can get away with.

The $0 budget effort I'd recommend would be Weka, a GPL Java toolkit for data mining.

A mid-price effort I'd recommend would be SPSS, very widely used, can go from, as I remember about 10000 UKP to a couple of hundred thousand UKP depending on your installation

A high-ticket item I'd recommend would be SAS Enterprise Miner (100K UKP - 600K UKP conservatively)

Alternatively, I hear the latest version of SQL server has some of the right stuff built in (say, logistic regression, association rule induction, decision trees).

Bear in mind with all of these you'll need to know exactly what you want, and how to get it.  It's rare to get a tool to hold your hand, and without experience it's easy to mess up what you are doing and get misleading results.  Depending on how many products you've got, your data set might not provide a large enough sample size - for instance if you've got 3000 seperate products and 1000 customers, dependent on their behaviour, it could well be difficult to get any reliable predictions or usable associations out of it.

Tuesday, July 22, 2003

Check out DI-Diver from Dimensional Insight.

Tuesday, July 22, 2003

