Fog Creek Software
Discussion Board

Data Mining Project Suggestions

I am a undergraduate student and I would like to do a three month project on data mining, specifically text classifiers. The project should solve some real-life problem - I am *not* looking for any theoretical one. One very good example of text classifiers is Google news - they classify news into categories.

I would like some suggestions for the possible projects please. I understand that 3 months is a pretty short period, but still, there must be problems out there worth a try.

I have decent programming skills and won't mind programming oriented projects.

Future Data Miner
Friday, March 19, 2004

I was thinking about this the other day: why not use text mining on source code? There are a lot of things to look for, and there are large open source code bases to test it on.

I'm sure there are things to fing and maybe a way to detect potential vulnerabilities (like buffer overflow), classifiy the use of comments, etc.

Friday, March 19, 2004

How about on Fog Creek discusion groups?

Friday, March 19, 2004

> How about on Fog Creek discusion groups?

Did you mean classification of threads in various categories (such as "technical help", "discussion on Joel's articles", "software processes" etc.) ?

Future Data Miner
Friday, March 19, 2004

How about grouping google results (or any search engine results for that matter) in meaningful categories?

I mean, if we search for apple, there might be results related to the company Apple, the fruit apple, rock band apple etc.

This seems to be hot topic these days and many companies are building search engines based on this idea. Google even provides APIs for using their search engine.

Friday, March 19, 2004

One word: resumes :)

Friday, March 19, 2004

Medical results.

A lot of stuff is granular enough where that isn't a problem, but stuff like radiology and mammograms are very hard to extract data from, at least for us, since the person reading the film basically dictates a paragraph or two of text.  The only quantafiable result we currently get is normal or abnormal.

It would also have the added benefit of being sellable if you get it to work.

Steve Barbour
Friday, March 19, 2004

Look for any problem that has a practical training corpus available to you. The last thing you want to do is spend eons of time handclasifying a substantial body of documents just to check the validity of your algorithms.

If you are going to do subject classification of newsitems, going for the iptc-subjectcode topics ( is the "pro" way, but you might need to beg some newsagencies for a corpus.

Just me (Sir to you)
Friday, March 19, 2004

A potential project in this field is classifiying messages by author. This is a topic of interest in law enforcement, because it would allow them to compare a message of unknown authorship against a corpus of messaged for whom the authorship is known. 

There is a theory that everyone has distinctive patterns in their written communications - they tend to use certain words, phrases, and punctuation. The main problem with this though is that the evidence from the text alone tends to be a bit weak.

A corpus for this particular problem is available in the usenet group rec.skiing.alpine, which has been obliterated by a huge flamewar since 1997, with many regulars of the groups posting under psuedonyms as well as their regular names.The flame war has involved court orders, firings, stalkings, death threats and more. It's pathetic to the extreme.

Another classification project is, of course, spam. For example, there is a need for a classifier for blog comments (and for mail) that can identify messages that convey no meaningful information. It is common for spammers and trojan writers to use messages that have very vague messages that could mean anything, in the subject line. One might not be able to definitely classify messages by this alone, but it should be possible to formulate some rule that says "this message has no meaningful content" (e.g. it has something vague like "see the attached file for information").

Friday, March 19, 2004

i was thinking about classifying UNIX man pages.
Those are often unreleated, but linked through references to other pages)

They come under the heading of

I guess that is a simple job of clustering stuff, so i will try to do it now ;-)

Michael Moser
Saturday, March 20, 2004

Thanks guys!

I will do a bit of resaarch on these ideas, and have something going.

Thanks again !!

Future Data Miner
Monday, March 22, 2004

Take a look at the daily press releases and articles for a given stock (e.g. AMZN).

Analyze those articles and compare it to the pricing action, to try and judge the general sentiment of the equity.

Thinking Hard.
Monday, March 22, 2004

*  Recent Topics

*  Fog Creek Home