Fog Creek Software
Discussion Board

Application Architechture ideas ?

Hi All

I am looking for architechture ideas for the following scenario

We need to import a large (50MB and growing !) file daily.  The file consists of between 10000 - 20000 sections all of which can be treated as an individual file, ie 10000-20000 seperate imports. The import process for each section consists of working through a line at a time, querying the database for various bits of information, and then writing a header and a number of lines to the database.

At present the process is done as a single monolith.  The file is opened at the start of the process, and each section is imported 1 at a time until the end is reached.  If any section causes an error (fairly common !) then the whole process stops, and has to be restarted from the section following the broken section

Current problems

1) Database contention - the import process is run by a VB COM+ components, and we have got it running fast enough that it utilises almost all of the database access whilst it is running

2) Scalability - there isn't any .  Ideally we would like to be able to scale this out over more than one machine !

3) Robustness - would like to be able to report about broken sections, but then carry on with the rest

Any thoughts - I'm not after solutions, just some ideas on how we could proceed.



Thursday, April 4, 2002

Ok... here are my thoughts; feel free to critique and tear apart.

We care most about the part in the middle that takes a (packet, DB connection, error output destination) as input, and outputs the (DB info, error code).  The "packet" can be a single section or group of them.  One possible error code is having too large a packet, if you pass one that's say 30 GB and the machine can't deal.

Connected to this function is an adapter that calls it with the input values.  For example, one adapter may be a Java servlet.  Calling this servlet may be a "salad shooter" that sends these packets to a bunch of machines through http or whatever protocol.  The salad shooter sends out the packets in multiple threads (with some max # of threads) and expects the error output.  Then it deals with the error.

I believe this takes care of your points:
1) db contention - you can upgrade to a distributed db, if writes don't intefere with reads very much
2) scalability - trivial to upgrade or use different technology; all that would be required is writing new adapters
3) robustness - it would all be in the error handling, and there would be code dealing with the server dying or timed out

I hope this doesn't sound naive or obvious, I'm still learning about design, and I think this is practice for me.  ;-)  Mainly I've learned that using adapters and error codes are important parts of design, because they enable anything to scale.

Jeff Spender
Thursday, April 4, 2002

If the sections are independent, I don't see anything in "architecture" that requires you to stop instead of continue after an error ... the change could be as small as putting "continue;" instead of "return;" inside a "for" loop (I'm speaking in C, not VB).

I guess you're saying that processing takes some noticeable fraction of a day, that you fully-utilise some resource of (the "database connection of")one machine, that you're not trying to further optimize the usage of that one machine's resoucrce (I would try to do that first), but instead want to distribute it over more than one machine.

I'm thinking of a pre-processor: take that 50MB file, open it up, re-save it as several files (with fewer sections per file fragment) ... that sounds sounds easy (to code) and quick (to run) ... where by quick I mean e.g. some small number of minutes: the process needn't take much longer than it takes to copy the entire file from one place to another ... my rule of thumb (using C) is that simple processing (parseing in RAM) is infinitely quick compared to device (disk) I/O. To maximize the speed of this preprocessor, don't try to split the file into 20000 subfiles (one section per subfile): because you would add the overhead of creating (opening etc.) each of the 20000 new files; and because probably you won't be distributing these to 20000 separate machines for database I/O. Instead, create as many subfiles as you have machines (2, 3, whatever). Before you implement, run a test on two machines to see how well it does scale when you access the database server from more than one database client machine.

Christopher Wells
Thursday, April 4, 2002

Dump each section into a queue and have a queue server manage the processing.

Thursday, April 4, 2002

50MB is actually quite small. If it isn't going to grow larger than ~1.5GB, map into into memory and process it in-memory - that would absolutely be the fastest way, and it's also very straightforward.

If it is going to grow larger than 2GB, be sure to test what happens BEFORE you get there for real - most probably, somewhere along the processing/generation pass something will break at the 32th bit.

a better, more scalable solution is probably to create the file "broken" into sections a-priori, and then it will be much easier to process each section individually.

Either way, look at the numbers before going overboard with architecture - 50MB is quite small - even 500MB fits into a (now) decent machine's main memory. Assuming the processing is mostly copying things from one place to another place (and perhaps summing or counting things), a 500Mhz machine should be able to process  50MB in under a second. Trying to distribute work sounds like an overkill - it sounds like a 20 line Perl script (If you don't know Perl, now would be a good time to learn) if you can settle for 10 seconds of processing time; If you need it to be as fast as possible, writing it in efficient straightforward C is probably much simpler than trying any distributed architecture.

Ori Berger
Thursday, April 4, 2002

Ori, I think you're right. One extra point, with SQL you can get ("SELECT") 20000 records in a single statement or database transaction, however you must INSERT them one at a time ... so inserting can take a long time.

Christopher Wells
Thursday, April 4, 2002

Christopher Wells wrote:
"with SQL you can get ("SELECT") 20000 records in a single statement or database transaction, however you must INSERT them one at a time ... so inserting can take a long time."

Not if you put the inserts into a transaction.

Jan Derk

Jan Derk
Thursday, April 4, 2002

On most databases, _even_ if you put it into a transaction - do consider that on many databases, everything is in a transaction whether you ask for it or not.

Most databases also support some sort of a "bulk insert". If you're using MySQL, you do that using standard insert ("insert into yourtable(column1,column2) values(val1,val2),values(val3,val4), ...." or something to that effect). Oracle and MSSQL has a "load file" or "load bulk". I've been able to insert tens of thousands of records/second on a properly configured low end machine running MSSQL, and a few thousands on Oracle and MSSQL.

That is another important point to keep in mind - grouping all database insertions into one lump of "bulk insert" has a potential of providing orders of magnitude increased performance.

Another type of solution which is sometimes better than anything else (depending on complexity of processing, amount of data and database latency) would be to load the 50MB file into a temporary database table, and do "insert into table1(col1, col2, ...) select complex_func(...) as col1, complex_func2(...) as col2 where ....", using subselects where appropriate. Doing everything inside the databases removes all latencies, and the query optimizer - while not perfect - usually knows what it is doing (and you can always hint it to go the right way, although that is non portable).

If you haven't yet done so, read Phil Greenspun's critique of application servers; It's extremely relevant here [ ]

Ori Berger
Thursday, April 4, 2002

Matt wrote:
"If any section causes an error (fairly common !) then the whole process stops, and has to be restarted from the section following the broken section"

Seems like a simple exception handler would solve this.

For the other questions you need to provide a bit more information. Like
what kind of database are you using? (local, client/server, MYSQL, MSSQL), how large do you expect it to grow, how many concurrent users, how long does importing take, what language do you use, etc.

About the scalibility: what do you want to scale over multiple computers: the file importing? (does it take too long?) the database (is it unreachable at times?). If you are talking about the latter you might want to check out a client server solution if you don't already use one.

Jan Derk
Thursday, April 4, 2002

Matt, a little more info on the db you're using would be very helpful to suggestions. 50MB is a small file compared to some of what I've seen done lately. We import and process 20 million rows (several GB) across the course of a  day and process it (quite extensively) in about 1 hour at night, on a PII-450/256MB with SS2K. In the past we've handled quite a bit more than that 20m.

Give us some extra info about what software you're using for the db. Chances are if you're using a server db like Oracle or SQL Server, you can do 100% of the processing on the server itself after a bulk insert. If you're doing a lot of back-and-forth processing of the data over the network, that's probably what's taking so long.

In general, get the data into the db as quickly as possible, then do the processing on the db itself.

Troy King
Friday, April 5, 2002

Jan, you said : Not if you put the inserts into a transaction.

On every RDBMS I've worked with, wrapping an explicit transaction around inserts doesn't give you all-at-once inserts, it gives you all-or-none inserts. In fact, transaction-wrapping can make the insert take longer. I can't think of a single instance where a transaction increases the speed of *any* operation (other than simplifying rollbacks). The only all-at-once nature of a transactional insert is that it will become visible all at once when the commit is reached (though it was physically being placed there, one at a time, just like a non-transactional insert, up until that point).

In most databases, transaction-wrapped inserts are happening in the table as well as the log, and for really big inserts, this can take longer because that portion of the log won't be checkpointed until the commit is reached, thus increasing the size (and decreasing the speed) of the active portion of the log. There are rare types of inserts that will be, for all intents and purposes, inserted at once, but those are the result of certain types of updates, not natural inserts.

Don't mistake transactions for speed enhancements. They're reliability enhancements.

Troy King
Friday, April 5, 2002

Troy, you are right in correcting me. What I said works for the database (DBISAM, ) I generally work with. In DBISAM transactions speeds up inserting a lot. And yes it can even take longer if there are a huge amount of inserts. That's why one has to commit every few thousand records or so.

I should have been much more general. Ori said it best: most databases support a way to do fast SQL bulk inserts.

Jan Derk
Friday, April 5, 2002

*  Recent Topics

*  Fog Creek Home