Fog Creek Software
Discussion Board




Storing long-term histograms

I have a project where a computer is collecting the cycle count of an injection molding machine via a data acquisition card: the molding machine has a sensor wired to the DAQ that pulses each time the mold closes.  This will be repeated for 70 machines here, and more at our other factories.

The data collected will be a series of timestamped pulses, and I'm having trouble deciding how to store it.  The data will be held for a long time in order to generate reports on machine performance.  I was thinking of a database table with two fields: machine # and a timestamp, with one record for each.  It has the virtue of simplicity, but it seems a little heavy at 12 bytes a pulse.  On the other hand, it makes the data easily munge-able with SQL.

Is there some normal way of doing this that I'm missing?  How are analog histograms normally stored?

Justin Johnson
Friday, July 30, 2004

"with one record for each" should read "with one record for each pulse".

Justin Johnson
Friday, July 30, 2004

Not sure what the generally-accepted way of doing this is, but here's what I've done.

Create a table of histogram buckets, with the following columns:  bucket_id, min_value, max_value, bucket_name.  Create a separate table with these columns:  bucket_id, count [, other data you might need to filter by, like machine_id, date, etc.].

Straightforward to implement, although you may find it annoying that you can't really tell whether to update or insert when you get a new data point.

One problem here is that you can't reliably change the bucket bounds later on.  One approach is to define a large number of very finely-spaced buckets, but give several buckets the same name and group by the bucket name when pulling data out.  So you could have separate buckets for 0-1 second, 1-2 seconds, 2-3, etc., but have them all named "0-5 seconds".  When you query, just group by bucket_name and sum the counts.  If you need more "big buckets" later on, but they're still oriented on "little bucket" boundaries, you can do it by changing the association of little buckets to big buckets.

schmoe
Friday, July 30, 2004

This is the kind of thing that's traditionally been handled with a flat file.  A relational database is overkill, even more so if you don't need transactions and rollback.

Is your storage space unduly limited in some way?  If not, store your data as tab-delimited ascii and use unix shell tools or perl to munge it.  grep, cut, awk!

Brent
Friday, July 30, 2004

Not that I have any experience in data collection, but my personal feeling is that there is no reason not to store data in a structured form anymore.  Flat files aren't even less complicated that SQL databases, as everything supports SQL these days. 

In fact the current generation of programmers seems more comfortable working with "SELECT FROM"  than open(); fseek(). 

christopher (baus.net)
Friday, July 30, 2004

If space is really at a premium, you could choose a start timestamp and store the difference between the a machine press and the time stamp. You should be able to store 20 years of hundreths of a second in four bytes + machine id size.

Rich
Friday, July 30, 2004

Simple. 

If the database supports word use one word field to store seconds since a specific constant time (should give you around 136 years) and use the other word to keep machine id.

If your db does not support word type just use 4 bit integer with the high word being machine id and low word being number of seconds since a constant time.

Code Monkey
Friday, July 30, 2004

If you mean a word as 16-bits (as it traditionally is) that's only 65535 seconds, which is about only about 18 hours (60*60*18).

In any case there's a crucial requirement missing, which is how frequently the mold presses, and what the required timing accuracy must be.

Dennis Forbes
Friday, July 30, 2004

The problem is that you want to index the machine #, and possibly the time stamp. That means you've got a HUGE overhead on an itty-bitty insert, which affects your scalability.

So the other option is to write to a delimited text file, then BCP the data into SQL every so often (if the machines are offline at night - voila!)

This distributes your processing power, minimizing your hardware requirements.

Philo

Philo
Friday, July 30, 2004

This is against every database design norm (and I cringe even writing it), but if you really want to save space, consider a separate table for each machine eliminating the need for the machine # (as it can be artificially created in the view). This will have the additional benefit of automatically partitioning the data such that there will be no contention as separate machines pulse. On the flip side with a platform like SQL Server 2000 or above it lends itself to horizontal partitioning (I don't know what platform you're using, but just throwing it out)

For querying you'll do that through a partitioned view so it will transparently query the separate tables (on different machines if desired, sharing the load).

CREATE VIEW MachinePerformance AS
  SELECT 1,pulse_time FROM Server1.MachineData.dbo.MachinePerformance1
UNION ALL
  SELECT 2,pulse_time FROM Server1.MachineData.dbo.MachinePerformance2
...
UNION ALL
  SELECT 98,pulse_time FROM Server4.MachineData.dbo.MachinePerformance98

Dennis Forbes
Friday, July 30, 2004

Hrmmm...skip that idea. With an artificial key the view isn't updatable. Damn those artificial keys!

In any case it's still a potential performance improvement with actual primary keys (hence no space savings) and with table check constraints.

Dennis Forbes
Friday, July 30, 2004

Very interesting answers.

To clarify our particular situation: we're using MS SQL Server 2000 on Windows 2000.  The machines cycle once overy 20-180 seconds, and we have 70 machines.  The pulse is on-off (or rather, off with a pulse occurring at a single point in each cycle).  Storage space is plentiful, and my first guess (machine #, timestamp) won't unduly stress either the hard drives or the processors. 

It's just that something bothers me about using something like a relational database, with all that overhead, to store something so simple.  On the other hand, we've already paid for the overhead, and nothing about this system will cause usage peaks--all the work will be evenly distributed around the clock.

I was hoping that there was some more *elegant* way to do it.

Justin Johnson
Friday, July 30, 2004

Ultimately the most elegant solution is a simple machine #, datetime -- virtually anything else (such as custom time storage in alternate data types) will be classified as a "hack" by all who follow. Given the rather light insert demands over an average period, I wouldn't bother with any custom solutions on the database end.

Getting the data from your data acq software to the database, on the other hand, could use some customization - given the highly variable amount of data coming in, consider serializing the data writes to the database - such as a single pooled queued object. Thus if 70 machines all triggered at once, there is no backlog or contention on the database -- the queue will build up momentarily (presumably you'll have datetime stamped the data on acquisition rather than on the insert into the database) and the queued component can individually insert each record into the database, allowing plenty of time for the database to maintain indexes, whatever.

Dennis Forbes
Friday, July 30, 2004

I think you don't know what is a histogram

better check before using the term: http://en.wikipedia.org/wiki/Histogram

If you really need a histogram, a tiny file with some counters (one for each bin) is enough.

Nix
Friday, July 30, 2004

I did a logger for a mobile pavement testing vehicle some time ago that put timestamped observations onto a removable harddrive for nightly upload to Foxbase.

Later did one end of a logger that had to get it right over homebrew TCP/IP from a state-of-the-ark lottery system.

Both systems stored data as it happened, timestamped, together with significant events such as machine startup/shutdown, operator comment ... into sequentially named ASCII flat files onto the disk as fast as possible.

The second system used RAID 1 mirrored hard disks and another processing thread copied the data onwards to a remote machine via a VPN. Justified paranoia.

Never throw anything away.
Timestamp everything that happens.
Equipment, especially storage, is cheap.
Raw data is irreplaceable.

Use one file sequence per machine logged.
Have all timestamps come from the one clock. 

IMHO, going directly from DAQ to SQL is way too perilous.

trollop
Friday, July 30, 2004

elegant?

I dunno, I think the unix flat-file approach is elegant in its own perverse way.

You could always use a WWV radio and a two-track tape recorder...

Brent
Saturday, July 31, 2004

"The machines cycle once overy 20-180 seconds"

Why the wide range? Different kinds of work?

[not related to the solution - just curious]

Philo

Philo
Saturday, July 31, 2004

The business does injection molding, where a mold is filled with liquid plastic under pressure, then cooled and solidified in the mold.  Smaller molds mean less plastic mean quicker cooling, and the size of the molds we use goes from little ones that are a foot on a side to large ones that are six feet tall.  The cycle time is largely determined by how quickly you can get a solid piece of plastic out of it that won't warp.

Justin Johnson
Saturday, July 31, 2004

I second trollop's comments: raw data is invaluable. In my experience, you give people one view and they say 'wow. what about changes over time' (or something). If you've got all the raw data, you can do magic things: maybe one machine cycles 0.3 seconds slower than the others, but prdouces higher quality output. You'd not see that in a histogram, or highly compressed data.

And keep it simple. No storage approach  that saves about a gig of disk space per year can't possibly be cost effective compared with the time to implement and debug it.

bah_humbug
Saturday, July 31, 2004

For raw data capture, I think a text file is simpler than a database.  Sure, a single 'INSERT INTO' command takes up the same amount of CODE as a 'fprint("...\n");.  But, the amount of processing is much less with the text file.

Eventually, you'll want to put some extracted/summarized version of the data into the database for analysis.  But it's nice to have the raw version around for later analysis.

AllanL5
Sunday, August 01, 2004

someone told me of a system which used insert into a database for high-speed logging, because the insert was LESS expensive from a resource-consumption point of view. But it may have been a custom DB. Handling data from multiple sources.

In your case, an insert is clearly the 'right' thing, but because you don't need the data real-time something like a file-per-machine parsed once-per-day would let you distribute the insert load at the risk of losing data if the file/parse process dies at the wrong point of time.

mb
Monday, August 02, 2004

*  Recent Topics

*  Fog Creek Home