Fog Creek Software
Discussion Board




Big Table Discussion

Hi guys,

I have a friend who has a SQL Server database that loads call records. The Call table receives call data. It has an integer audo-id column.

All call data gets put into one Call table.
Repeat: There are not multiple tables, one for each month etc. all data goes in just the one table.

Obviously, this solution does not scale infinitely and performance would degrade as the table grew and grew.

But... what will actually happen when the 2 billion-th call gets added. When the auto incremented id clicks over will it generate an overflow error (assuming there's plenty of disk space etc.)

What experiences have others got with big tables in databases?

Common practice is to divide data into multiple tables, one per month/year etc. Does anyone else have any novel approaches?

Would people demand this forward planning when thinking about big enterprise transaction systems etc.?

Tim H
Tuesday, January 06, 2004

Wouldn't it be roughly the 4 billionth call? 

Now I'm not sure whether you mean call as in a phone call, or call as in a function call, but I'll procede with "phone calls" although much is equally relevant to function calls.

And why not segregate tables based on the data itself.  I have no idea what you have available, but something like areacode, or if your data is lopsided in that aspect, a couple hundred tables based on the prefix (a fairly random value). 

If you're worried about receiving 4 billion phone calls, then you might want to contact the phone company because they all aren't going to get through.  That's roughly 136 calls a second 24 hours a day, 365 days a year (assuming you receive them all within a year).  At the end of the year, just archive the data, clear out the table. 

And what about 64bit RDBMS's?  If you receive a mere 37 calls per second, you won't run out of numbers until the universe roughly doubles in age.

Elephant
Tuesday, January 06, 2004

You would use a bigint type to store numbers in the range from -2^63 (-9,223,372,036,854,775,808) through 2^63-1 (9,223,372,036,854,775,807). Storage size is 8 bytes.

That should be plenty.

Just me (Sir to you)
Tuesday, January 06, 2004

If you're really concerned, take a look at the Daytona database system developed by AT&T Bell Labs:

http://www.research.att.com/projects/daytona

To quote from the spec page:

"... Daytona is managing a 94 terabyte, 7x24 production data warehouse whose largest table contains over 207 billion rows as of Nov 2003." Which should be big enough for any reasonable number of calls that anyone less than the largest telephone company in the world is going to have to handle"

I looked at en passant once for a large scale, 4-D seismic datastore. What seems to make it fast is that they don't have the database do anything that the OS should be doing (like record locking etc.) and they precompile queries from the Cymbal language (a superset of SQL) into ANSI 'C', then compile and run a native executable.

I think I saw somewhere that there's work going on to create XML output directly from Cymbal queries. There may be more on their website, I haven't looked very recently.

David Roper
Tuesday, January 06, 2004

One solution is to have something like two databases - a production one for online almost immediate access and one for other less stringent needs like data analysis and reporting. These need to be synced up say on a daily basis. 

Periodically purge the online database of old records, how old can only be decided after looking at the usage patterns.  This way access to the online database will always remain fast. If you maintain the same schema across both databases it is trivial to write code which will work on either.

As those who want to do reports or complicated data analysis to do it on the "offline" database and by offline it is not my intention to say that it is not connected to the net or anything...just terminology to distinguish near real time from other access

Code Monkey
Tuesday, January 06, 2004

If you're worried about running out of unique integers, consider a different primary key.  Possibly a timestamp and a phone line as the primary key.  That guaranteed to be unique and it's impossible to run out of keys.

Clay Dowling
Tuesday, January 06, 2004

Unless Tim H is working at Primus or the various baby bells he is not going to an order for a Bell Labs database worth quarter of a million dollars in software. And even so... Anyway if you were really bent on doing OLAP+OLTP you might look at a near free solution like Mnesia (though, like most telecom database, they rarely resemble anything like SQL Server except for a few SQL emulation modes).

Before you even hit big int, you might want to seriously consider partitioning your data into logical series (along month sounds like a good idea) and then connect them together into a virtual table.

If you do a star schema structure and know a bit about OLAP you should be able to get away with a fairly small db farm.

Li-fan Chen
Tuesday, January 06, 2004

2 billion seconds is 60 years.

How many calls do you get per second?!

Insert half smiley here.
Tuesday, January 06, 2004

Damn. Sorry for duplicating Mr Elephant's point.

Insert half smiley here.
Tuesday, January 06, 2004

With all due respect, if you believe that common practice is to separate data out into separate tables to limit search scope, then the common coworker you have (given that you believe this is common practice) could best be described as a hack, as that's exactly what that is -- a hack.

It betrays an obvious lack of knowledge regarding indexes, and how the largest table is highly efficient given appropriate indexes for the access patterns.

Dennis Forbes
Tuesday, January 06, 2004

As a sidenote, you can fairly easily swap that int for a bigint, allowing for 9223372036854774807 distinct record identity values if it really is a concern that you'll overflow it.

Dennis Forbes
Tuesday, January 06, 2004

I didn't try it, but if you really want to know what will happen just create a table with an IDENTITYCOL and seed it with the max integer, then insert a few rows.

Tom H
Tuesday, January 06, 2004

Dennis, just cause indexes help most of the time, doesn't mean they help all the time.  Sure, indexes can help you speed through a table to find specific rows, but if your doing aggregate queries on that much data, its going to take some time just to load it into memory!  And 4 billion rows is a LOT!

vince
Tuesday, January 06, 2004

"Dennis, just cause indexes help most of the time, doesn't mean they help all the time"

If you understand indexes then you understand how to make them help all of the time for the access patterns used. Indexes in databases aren't rocket science.

"but if your doing aggregate queries on that much data, its going to take some time just to load it into memory"

If you're doing a aggregate on a table without any where predicates, then obviously you intend to aggregate all of the data -- splitting it out into separate tables just means you have to go to separate places to find. If, on the other hand, you are doing a timespan aggregate, and your clustered index was on the time column, then voila -- an index scan. Clean as clean can be, and ultra efficient.

People usually look to things like splitting out into data span segregated tables because they don't understand indexing or how indexing is used, which in my book qualifies as a hack. Mind you there are justifiable cases where you would want to split the data out (for instance horizontally partitioned views, or historical archiving of old records if data usage was an issue), but that doesn't sound like this case.

Dennis Forbes
Tuesday, January 06, 2004

well, I'm not a database whiz, but I'm no dummy either....

I've had problems when *join* tables have been millions upon millions of rows, even though they were indexed.  Although I did have where predicates, they were all on the table being joined, so I believe the entire table had to be scanned into memory.  If theres a way to solve this, I'd be open to any ideas. 

vince
Wednesday, January 07, 2004

If doing aggregagtes is the main use for this data you might want to use a flat file.  It can be a lot faster to aggreate data using a file instead of a table.  I believe Oracle has an option where a flat file can be accessed thru the database to avoid the overhead of loading data.  I don't know if SQL server supports this but I have seen options to access flat files from ODBC.

john
Wednesday, January 07, 2004

I wouldn't split the table on the basis of an artificial limit, though if it were being used for billing I'd probably manage it by the period of billing.  Not to have tables for each period but a table for the current period and a table for historical periods.

But it all hinges on how the data is to be used.  If the data is required to be available for all periods then I'd keep it all together in the same structure.  As for indexing I'd probably use the elapsed seconds in the year together with the year number as the main (non-unique) index. 

Simon Lucy
Wednesday, January 07, 2004

*  Recent Topics

*  Fog Creek Home