Fog Creek Software
Discussion Board




Correct term please

I need some help with the correct term for something.  If you have an application collecting data from multiple sources, you want the application to manipulate the data such that it all uses the same terminology/codes/scale/units.  You do this so you can use the data from different sources together and eventually do some data mining.  My first instinct is to call this "normalization".  I think that's what a scientist would call it but then "normalization" has a more specific meaning within databases.  I am also thinking "standardization" but I'm not sure that captures the whole idea.

Is there an offical term for this process in the field of data warehousing/mining?

name withheld out of cowardice
Wednesday, June 09, 2004


Aggregation & Normalization?

my 0.02

KC
Wednesday, June 09, 2004

I don't know if this is the correct term, but we call it data validation. It involves collecting data from multiple sources and validate/fix it so that it can be fed to the warehouse.

Just say stuff like "Data validation takes place in our  staging area" and nobody will object ;-)

Patrik
Wednesday, June 09, 2004

We called it data cleaning or data transformation.

The more creative folks used "data scrubbing."

Lauren B.
Wednesday, June 09, 2004

The process you're describing has a well known acronym, ETL: Extract, Transform, and Load. It describes the generic process of taking data from a variety of sources, transforming the data so that it all gels well together, and then load it into a secondary data source.

Data warehousing is one example of process that generally relies on ETL to fill the warehouse. (There are also some specific requirements for something to be a data warehouse; namely, a dimensional model that works well with data cubes and ad-hoc reporting tools.)

Brad Wilson (dotnetguy.techieswithcats.com)
Wednesday, June 09, 2004

Homogenize? Standardize? Regularize? Normalize? Marshal?

Not sure if there's any one 'approved' term for it...

Matt
Wednesday, June 09, 2004


Not in the context of data mining, but generally, it is refered to as metrication, or even metrification. I hope that comes close.

Sathyaish Chakravarthy
Wednesday, June 09, 2004

Consolidation?

Derek
Wednesday, June 09, 2004

You'll probably find:

http://www.crisp-dm.org/download.htm

useful, it's a process map for data mining, and lays out
all of these terms and a process around it too.

Konrad
Wednesday, June 09, 2004

What you are creating is a Data Warehouse to integrate and standardise your data.  Then you need an application to access the DW.

Data warehouses integrate data from different sources: relational db, text files, etc…

Unlike operational DB - which are designed for on-line transactional processing (they are normalised) so transactions like update, insert and delete can be performed quickly - DW are usually not normalised to facilitate querying and generation of reports (you avoid joins in your queries and so). 

You standardise your data when you move it from the original source to the DW.

Cecilia Loureiro
Wednesday, June 09, 2004

This is orthogonal to the OP but it reminded me of it.

The EC has a term for approving things as being standard, though it requires you to break your throat to say it.

Homologation.

Which is not, as might first seem, a term for a group of gay people deputised as ambassadors to a far away world.

Simon Lucy
Wednesday, June 09, 2004

The OP's first instinct was on the right track:  Enterprise Integration Patterns defines this type of process as a "Normalizer."  Of course, that's completely different than the RDBMS notion of "normalization."

Check Out: http://www.eaipatterns.com/Normalizer.html

I would have guessed aggregator, but evidently that's something that waits around and collects related async messages and sends them out when the whole package is complete.

Joe
Wednesday, June 09, 2004

The people on my team sometimes refer to this process as "canonicalization".

Big word. Lots of letters. Makes us look smart.

Benji Smith
Wednesday, June 09, 2004

I'd call it "data fusion", although that's a term more common in military applications for what you're doing.

Tom H
Wednesday, June 09, 2004

Another vote for ETL.  At least that is what it is called at our shop.

Jim L
Wednesday, June 09, 2004

*  Recent Topics

*  Fog Creek Home