Fog Creek Software
g
Discussion Board




Efficiently storing time-series data

Howdy everyone,

I'm currently working on a web application that needs to store price history data for a large range of stocks, indexes, options etc etc... I'm wondering about the best way to store this data in a relational DBMS*, however I'm not sure of exactly how. I can think of a couple of ways:

1) One big table, or one per symbol type, keyed by symbol and date.

2) One table per date, merged if necessary, using, e.g. a MySQL Merge "table"

3) One table per symbol, again merged to provide a virtual table.

4) One table per symbol, or 5) one table per date, not merged.

I'm leaning towards one table per symbol, although this could result in there being thousands and thousands of tables... however, most queries are going to be based around a single symbol, rather than a single day - e.g. drawing the graph of a stock price over 10 years would involve querying around 2000 "date" tables, but only 1 symbol table.

Any thoughts? I currently have a few books coming, but also want to hear opinions from "the field" :-)

Cheers,
Christo

* I'm just assuming this will be the best way... possibly there are others that would be more efficient, but I can't imagine what!

Christo
Monday, September 29, 2003

How many rows are you talking about here? It doesn't sound like too much data for a single table. Indexes would take care of query speed.

Troy King
Monday, September 29, 2003

If the time data is as granular as a second then I'd convert the  date and time to seconds elapsed this year (and a year field).  If you have more than one kind of event that can happen then have that as well.

Simon Lucy
Monday, September 29, 2003

Based on some clues in the original question (2000 "tables" for a 10 year query), I am assuming that the data is being captured daily.  Is this correct?

If so, there's absolutely no reason why you shouldn't store all of this in one table.  How large is your "large range of stocks, indexes, etc.?"  Tens of thousands?  Hundreds of thousands?

100,000 rows per day times 200 days per year times 10 years is still only 2,000,000,000 rows--  depending on the required response time and the hardware budget, this shouldn't be a problem.

Michael Dorfman
Monday, September 29, 2003

You *might* want to give serious consideration to storing this data in some form other than a relational database.

Imagine, e.g., storing each entity's price series in a separate file, just as a big pile of 32-bit integers (in, say, thousands of a cent), one per sample, with some magic value meaning "no entry here". Then your queries become scarily fast and your data is stored about as compactly as it could be. (If you didn't care about access speeds, you could probably bring the space down by a factor of 4 or so by storing deltas and compressing them in some simple way.) Every time you add a new sample you'll need to lock that entity's file while you do so. None of this is difficult, and it could give you a substantial win over a relational database. A factor of 3 in both space and time wouldn't surprise me. (A little wrinkle: If your operating system doesn't do a good job of keeping files unfragmented, then you might want to make the update process take a fresh copy of the file it's updating every now and then.)

BUT:

If your sampling dates are different for every stock, or if you need to store lots of information that's tied to dates other than the prices, or if you want to avoid the danger of getting the locking wrong and trashing your data, or if you sometimes want to do queries on a fixed date and all your stocks, or if you want to combine this stuff seamlessly with data that for some other reason has to be in an RDBMS, or if any number of other things are so that I'll leave for you to think about, then the costs are likely to outweigh the benefits.

Gareth McCaughan
Monday, September 29, 2003

Gareth, how does that solution affect *analyzing* the data? (i.e. cubes, progression analysis, etc)

Start with "which stock had the greatest increase over this six-week date range?"

Philo

Philo
Monday, September 29, 2003

Gareth -- you think he can write something that can query faster than a dbms that's had people coding on it for speed for years? Not likely -- there's very few people on the planet capable of even coming close to that.

Troy King
Monday, September 29, 2003

Troy, consider the following scheme (used by e.g., by kdb:)

Store each numeric column as a raw vector (4 bytes per integer, 8 bytes per double, etc.) Store each string column as an index into a table - again, 4 bytes per row. The table will have to be memory resident, but this is easy to do for stocks as there aren't too many symbols (20 bytes per symbol * 100,000 symbols is ~ 2megs of memory).

Now, map each column into memory from a local disk; This takes more or less zero time. All "full table scans" will be as efficient as they can be made - the O.S will cache and read-ahead everything for you. Computing, e.g., the average value-at-price reduces to this loop:

for (i = startindex; i < endindex; ++i) {
.    volume += mapped_volume_column[i];
.    price += mapped_volume_column[i] * mapped_price[i];
}
return price/volume;

This loop, on a decent compiler, runs at a very small number of cycles per iteration (e.g. 10). On a decent O/S, pages will be fetched in advance (this is a perfectly predictable linear scan of file and memory). If you use a smart compiler, it will code prefetch instructions to make this loop execute even better.

I'm not aware of any RDBMS, with the possible exception of kdb, that can compete with tight code as such, assuming that the queries are simple enough to code, and that they can't make effective use of indices (which is often the case in timeseries data).

Furthermore, for a small requirement set, it's probably much simpler to do it this way than with a database.

The _flexibility_ of a standard RDBMS is hard to compete with - you have a general purpose query language, management tools, etc. But the _speed_ of a general RDBMS is not impressive for most tasks compared to a custom solution - it's usually three orders of magnitude lesser than what a well designed but not particularly optimized implementation can achieve.

Ori Berger
Monday, September 29, 2003

Star-formation I think is pretty common for stock prices tracking in RDBMS, otherwise I don't know at all.

Li-fan Chen
Monday, September 29, 2003

Ori, where did you learn all that?? That's amazing :-)
(not being sarcastic) I think KDB has a discussion thread, but I don't remember reading anything that detailed before...

Li-fan Chen
Monday, September 29, 2003

Ori's technique is nothing new to your average APL, A, J, K or programmer, who has been doing this stuff for more than 20 years.

Paul Mansour
Monday, September 29, 2003

I have a similar application to maintain. We provide a wide variety of queries for technical trend analysis on all kind of weird logic (think, which stocks should I buy / sell today).

The star pattern is what you need. Put everything in one table and index it separately on date and stocks and you should be fine. "very old" data can be archived out of this table to another table when the number of rows in it becomes unduly large (we archive data older than 12 months as we record prices every 1 seconds).

We use SQL Server and it is very fast and give under 2 seconds response for even complex queries (of course they are all optimized to death).

Tarun Upadhyay
Monday, September 29, 2003

Thanks heaps everyone for all your ideas... it looks like an RDBMS is going to be the way to go. I was a bit worried about problems occuring over time, but if a table can hold a couple of billion rows* without too much trouble, then it should be able to hold the 40,000,000 or so rows a year I'll generate! (Up till now, the biggest table I've used in a production system was around 700,000 rows.) I'll go with a flat-file/in-memory based approach for the most recent data (which is what the vast majority of the queries will need to access) if the RDBMS gets too slow.

Cheers!
Christo

* Ignoring the differences between RDBMS's, need for table tuning etc.

Christo
Monday, September 29, 2003

Apparently you're dealing with 2 entities:
- item (strong entity - ie can exist indepent from other entities)
- rate (weak entity - ie cannot exist without a parent tuple)

The relationship between the two is: "An item has 0 or many rates".

The item should use an artificial key for identity.

Rate instead, should use as key the combination (item.id, rate.date_created).  It should also contain a nullable field date_expiry which should be null only for the lates valid rate. Therefore querying for the current rate would be:

select * from rate where rate.id = ? and rate.date_expiry is null;

Querying for the entire history for on rate would be:

select * from rate where rate.id = ? and rate.date_expiry is not null order by rate.expiry_date;

Anyways something like that; for more, search on "slowly changing dimensions". Secondly, try and read a bit on the relational model, because what you're trying to do (one table per symbol) is against what the relational model is (ie set theory extended to the relational algebra)

Dino
Tuesday, September 30, 2003

Hi Dino,

Thanks for your pointers - was planning on storing the data as time points, rather than over a time range, but very interesting stuff none-the-less!

And yes, well aware that a "table per symbol" is a rather un-relational model, but it was a possibility I was entertaining, if other ways were too inefficient, because I could still programmatically create the queries - e.g.
"... WHERE symbol = 'XYZ'"
=>
"SELECT ... FROM symXYZ ..."

Hope that explains my reasoning a bit more!
Cheers,
Christo

Christo
Tuesday, September 30, 2003

I have no "insider" knowledge of kdb; I just read everything Arthur Whitney writes, and try to learn from it. While k/kdb are not open source, Arthur doesn't keep anything secret.

As Paul mentioned, this is standard practice for an APL / J / K programmer, and even Fortran programmers use them all the time (memory mapping not included last time I used fortran, but perhaps things have changed).

Between APL dialects (I prefer K) and Lisp, all the amazing stuff is either already there, or is a small incremental step that doesn't require a new language (OO, AOP, Metaprogramming, Vector processing are all _user level_ constructs in Lisp, for example, whereas the require language extensions in C, C++, Java, and other languages).

Most of the recent buzzword compliant stuff is there for its own sake; You _can_ do better without it.

For such (often old) amazing stuff and more, learn from the guys that did it will 20 to 40 years ago: Whitney, Iverson, Steele, Moon, Apter, van Rossum, the guys who designed Pick (don't know who they are), etc.

(Ok, so van Rossum is younger, but he does deserve to be regarded as an "elder" imho).

Ori Berger
Tuesday, September 30, 2003

Time series data is one of those few areas where the odbms might be considered.

http://www.kaistgsm.ac.kr/re_center/paper/PP-1996-003.HTM

http://www.objectivity.com/DevCentral/Products/TechDocs/Whitepapers/WallFiles/WallPprSummary.html

fool for python
Tuesday, September 30, 2003

I would suggest the following:

Table 1 (stock, index, etc.)
stock_id
current_value
other_stock_info

Table 2 (history)
history_id
stock_id
history_value
begin_date
end_date

Example:

Table 1
1, $2.00, ATC

Table 2
1, 1, $1.29, 01/01/2003, 01/03/2003 (prior and original value)
2, 1, $2.00, 01/03/2003 (current value)

I have used this paradigm to track ownership history and change of status history with good results. Just note you may be tempted in table 2 to only have one date, while this does simplify data updates (you don't have to go back and update the prior ending date) I find that having the two dates makes queries much simpler.

Justin K.
Tuesday, September 30, 2003

Philo: Um, did you read my second paragraph? :-)

Troy: Yes, I think he can write something that will be faster *if he doesn't need all the other stuff an RDBMS does infinitely better*. The guys at Oracle and Microsoft and wherever-PostgresQL-is-being-developed are very good at what they do, but the problem they're solving is: "How can we make a SQL database, with all the flexibility it requires, store data really compactly and operate on it really quickly". That's a constrained problem. When the constraints don't apply, you can sometimes get a much "better" solution.

The underlying point I was making is the same one Philo was making: Christo needs to think hard about what he's going to want to do with the data before he can make a meaningful decision about how to store it.

Gareth McCaughan
Tuesday, September 30, 2003

http://www.cs.nyu.edu/cs/faculty/shasha/papers/aqueryseminar.ppt

Gives good examples why a standard RDBMS is not a good solution for time series data, and what would a good solution look like.

Ori Berger
Tuesday, September 30, 2003

For star schemas check out:

Use the find feature on your browser to search for "Dimensional Data Modeling" on http://rhea.redhat.com/asj/data-warehousing.html

Anonymous
Wednesday, October 1, 2003

*  Recent Topics

*  Fog Creek Home