Fog Creek Software
Discussion Board




MS Excel spreadsheet -> RDBMS automatic conversion

I don't know if it was discussed before here, tried the search, but it didn't return any good results.

Problem is ordinary - we have built a new in-house business critical system, which covers that was previously done in Excel for years, so there are several million dollars of information worth left in spreadsheets and, of course, we'd love to convert it into new RDBMS application.

Now, the conversion itself will require some intelligent code to cope with all little differences between thousands of spreadsheets - manual entry is not an option (at least yet :-), before we prove its only option).

Usually, we would do: manual Save As -> CSV format, then have some code to load it. This time due to the scale of conversion this technique is not most appropriate.

Please bomb me with questions and ideas (no constraints as for now), everything will do.

Vlad Gudim
Tuesday, April 06, 2004

Automation?
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_wrcore/html/wroriautomatingexcelusingexcelobjectmodel.asp

Just me (Sir to you)
Tuesday, April 06, 2004

Just some random thoughts...

Do your spreadsheets have any commonality - column headers etc?

Knowing the target database would be quite handy. Is it SQL Server? Can you use DMO to create DTS packages, or better yet load the SS into an object and parse it directly.

Bear in mind that the best/cheapest solution may still be to get a load of temps to rekey it. Or slightly more expensive temps to analyse the SS and apply column headers or import using DTS.

Does Excel support an easy way to save as MS Access table (as a halfway house)?

Justin
Tuesday, April 06, 2004

I'm looking on automation already (via C# + .NET) and it seems to be likely candidate, although conversion mustn't necessarily be very fast I'm still concerned with speed in this case.


Our spreadsheets are supposed to have commonality, sure they look alike, but often the same piece of data is in B5 or D5 - the easiness of changing spreadsheets was too tempting for our engineers to always stick to the standard. Sometimes you get labels for  data in adjoining cells, no columns.

DB is Progress v9, but its accessible via ODBC.

Data in Excel not exactly in the "table format" its more like RDBMS laid plain across several spreadsheets - our engineers, God bless them, are well educated and very creative professionals.

Could you give me a bit more clue about "getting a load of temps to rekeying it", please?

Vlad Gudim
Tuesday, April 06, 2004

He means contact a local recruitment agency who will hire you a number of temporary workers to manually copy the data from one place to another. If that kind of thing doesn't exist where you are you could always offshore it ;-)


Tuesday, April 06, 2004

I often find importing and exporting things from Office is easier if you go through one or two intermediate stages.

See what happens after you import the stuff into Access. Then try and running the table splitter to do some normalization. At the very least you'll get some idea of the possible difficutlties.

Stephen Jones
Tuesday, April 06, 2004

I have successfully used access to automate the import of relatively unstandardized Excel spreadsheets into Oracle tables.

Access can interpret the first row of spreadsheet data as column names when importing data. The order of the columns is unimportant.  So the manual effort can often be restricted to makeing sure that column names are correct.

One problem with importing spreadsheets is that a sequence of digits may be interpreted as text or as a number (or date). I find it easier to import into a table where every column is a text column and then convert number/date columns while running various scripts to clean up the imported data.

Rich Fuchs
Tuesday, April 06, 2004

Vlad,

>> I'm looking on automation already (via C# + .NET) and it seems to be likely candidate, although conversion mustn't necessarily be very fast I'm still concerned with speed in this case. <<

Define the start (top left) and end (bottom right) of the Excel range, then read the whole lot in one go - worked mucho fast enough for me.

Object[,] importData;
Excel.Range excelRange = excelWorksheet.get_Range
                                (rangeCellStart, rangeCellFinish);
importData = (System.Object[,])excelRange.Value;
Int32 maxRows = importData.GetUpperBound(0);
Int32 maxCols = importData.GetUpperBound(1);

HTH,

Mark
----
Author of "Comprehensive VB .NET Debugging"
http://www.apress.com/book/bookDisplay.html?bID=128

Mark Pearce
Tuesday, April 06, 2004

An alternative to Office automation is using something like perl's Spreadsheet::ParseExcel module to slurp out the data in whatever format you want. You could write results directly into the db or into whatever intermediate format you want to use for cleanup.

John C.
Tuesday, April 06, 2004

Vlad,

If this were my project, I would:

1. Examine the spreadsheets enough that you can arrive at a "standard" template for each type of table stored in a spreadsheet.

2. Hire an intelligent temporary employee to come in and standardize each Excel spreadsheet so that all of the files conform to one of the templates.

3. Import after that should be relatively easy - there are lots of tools available to convert Excel data to different formats, and then into the format you need.

HTH,

Karl Perry
Tuesday, April 06, 2004

Thank you all for your thoughts and help. I've done large scale data conversion myself couple of times before and I know what a pain it is.

The best tactic, which worked for me so far is to do 70% automatic and 30% manually controlled data transfer.

My spreadsheets are mixture of "label on the right - value on the left", "label on the top - value on bottom" and tables of values. Many worksheets relate to others, some workbooks have external links. No reliable standard for a position of a certain piece of data, although it usually can be located in some range.

There are more than 2000 spreadsheets, getting someone to re-arrange the data is not likely to be an option - chances are they'll mess the data up moving it around.

Probably, I'll identify separate pieces of data along with labels and the approximate cell range they can be in, then search for the label and try to locate respective value. The further analysis will be made based on a format of the value itself. On this stage my code may go to one of the live production DBs and try to figure out if the value matches with any of existent physical entities of a kind.

Once data is recognised and sorted I'll show all values "still in question" to a user along with the spreadsheet letting him or her to make an intelligent correction and, wooo-oo-oops, then it goes to DB.

Vlad Gudim
Wednesday, April 07, 2004

*  Recent Topics

*  Fog Creek Home