Fog Creek Software
Discussion Board




database storage

Hi Folks,

I will have a mass (85Gb) of directed (cyclic) graph data: nodes and links etc.

How best to store and query this data in a conventional RDBMS?  I want to ask questions like "for node x, what nodes are neighbours (in all directions)" and, if possible with a normal database, offload things like route finding and layout stuff to the server where the data sits too.

Any ideas?

i like i
Thursday, December 18, 2003

Here is a link to a discussion I found over at asktom,

http://asktom.oracle.com/pls/ask/f?p=4950:8:10713976984635220865::NO::F4950_P8_DISPLAYID,F4950_P8_CRITERIA:12678058160686,

They are using oracle databases there, and I dont know if it applies to you.

HTH,

Patrik
Thursday, December 18, 2003

What's the reason for wanting to represent this "in a conventional RDBMS"? It seems like a slightly odd choice to me...

The "Ask Tom" discussion is about a different kind of graph. I don't think it's very relevant here.

Gareth McCaughan
Thursday, December 18, 2003

This might be intresting...


Mapping Geometric Data with Oracle Spatial

http://www.oreillynet.com/pub/a/network/2003/11/10/oracle_spatial.html

Patrik
Thursday, December 18, 2003

85 gigabytes, with multiple clients accessing it.  Too much for RAM, so either I a) write my own server, complete with caching mechanism, synchronisation, disk storage etc; b) use an off the shelf thing like a RDBMS (but how?); c) .. any suggestions?

Using an RDBMS doesn't solve the 'pushing live updates to clients' problem, but oh well.  Ideas there too?

I have trouble working out a spatial representation of the web.  That is half the problem I guess.

i like i
Thursday, December 18, 2003

If all you are lookin for is objects and their relations to each other (many-to-many), you can implement that with a simple reference/link -table

Ie.

Node
========
id
data

Noderelation
===============
nodeid-1
nodeid-2

(However I suspect this ain't what you're after).

Juha
Thursday, December 18, 2003

Hi Juha,

that is exactly what I've been looking at.

The direction is encoded by the numbering of the node fields in the relation table too.

But then you have several options for the querying the neighbours of node #42:

a) query for all Noderelations where nodeid-1 is #42 or nodeid-2 is #42
b) query for all Noderelations where nodeid-1 is #42 union query for all Noderelations where nodeid-2 is #42
c) add a direction field to each Noderelation.  For each relation, double enter it once in each direction; then query for all Noderelations where nodeid-1 is #42
d) your suggestions welcome please!

So indexing both nodeid-1 and nodeid-2, or just nodeid-2?

And offloading things like auto-2d layout of nodes to the server instead of a client having to try to swallow the whole dataset?

With database record packing and indices etc, I bet this takes a bit more than 85gig :-(

And finally, live model updates pushed to other observing clients?

i like i
Thursday, December 18, 2003

This might possibly be a good application for an  "object-oriented database": make each node in your graph be an object and represent the edges by object references. (Something along those lines, anyway, depending on what other data you have associated with each node and edge.) It is alleged that OODBs perform much, much better than relational databases when they match your application well, and this seems like a case where they might. I have no experience of using them, so can't make concrete recommendations.

Can you tell us a bit about what you're doing with this monstrous billion-node (or whatever it is) graph?

Gareth McCaughan
Thursday, December 18, 2003

For a database of that size you'd need some expensive kit.

For such a large map of nodes, it may be work partitioning regions of the map into separate sets of tables. This will make you database less contentious when it comes to read/write locking.

You could even move these larges sets onto separate servers, e.g. four, one for each quadrant. You'd need a layer of logic to manage these partitions.

In my opinion I'd give it a try with MySQL, it free so you've nothing to loose but your time trying. You'd need a layer of logic to manage the partitioning.

This wouldn't have anything to do with web links would it?

Tim H
Thursday, December 18, 2003

I have succesfully done simple global node1id - linktype - node2id tables for directional graphs with typed links on very modest hardware (PIII 1Gz/512Mb - SQL Server 2000), but only for a few million nodes, far emoved from the volume you indicate.
Depending on your query pattern and graph linkage you might gain by partitioning etc.

Just me (Sir to you)
Thursday, December 18, 2003

I'm not sure if this is relevant, but you might want to investigate "Material Path" and/or "Nested Tree". Both let you efficiently represent and query hieracharies/trees in RDBMS.

Material Path ends up looking like a numbered outline (1, 1.1, 1.2, 2, 2.1, 2.1.1, etc.). Material Path is also very efficient in adding/removing elements as long as you don't give in to the tempation to renumber when adding/removing/sorting your nodes. i.e. Don't use the material path numbers for display or the user's will get confused by gaps. Counting the separators (in my example, the periods) tells you how 'deep' you are.

Nested Tree applies two numbers (Left & Right) such that all values between L&R represent descendents of that node and that L + 1 or R - 1 is a 'direct' child. Nested Tree is less efficient during adds/removes since all L/R numbers higher than the added/removed node must be recalculated. Although it's a simple increment/decrement with trivial criteria, it's still a recalc. I'm not sure how you easily get 'depth' out of a Nested Tree.

Ron Porter
Thursday, December 18, 2003

I don't think there's any adaptation of those ideas that works when the graph has cycles.

Gareth McCaughan
Thursday, December 18, 2003

i like i, how many nodes are there?  I suspect not 85,000,000,000?

Scot Doyle
Thursday, December 18, 2003

Or maybe 85 Gb = 10,625,000,000 GB?  So is it 10 billion nodes or significantly less than that?

Scot Doyle
Thursday, December 18, 2003

blah, 85 Gb = 10,625,000,000

not 85 Gb = 10,625,000,000 GB !!!

Scot Doyle
Thursday, December 18, 2003

PostgreSQL has some first class support for storing and querying geomatic elements.

Li-fan Chen
Thursday, December 18, 2003

85GB.

There is about 4 million nodes.  As you can see, each node is pretty light!  Each node is just a 32bit document identifier which clients resolve from a separate server that is already there.

It is the cycles that kill the querying, I think.

i like i
Friday, December 19, 2003

In a post above, Just Me said that he had a few million nodes, so maybe you could ask him for specifics since you only have 4 million.

Another question that comes to mind is: how many links total do you have?  With only 4 million node would it be possible to create a main-memory structure that would fit on a single server?  The document data could still live elsewhere.

Scot Doyle
Friday, December 19, 2003

Mine had only a few links per node, making the table far far smaller (about 700Mb if memory does not fail me) than the one that the questioner is describing.
He seems to have on average a few hundered links per node.

A lott of the strategy depends on the queries and wether the dataset is mostly static or is continously updated, wether there are nice predictable "islands" of heavily connected things with only sparse linkage in between etc etc

Just me (Sir to you)
Monday, December 22, 2003

Oh wel, MB for the purists ;-)

Just me (Sir to you)
Monday, December 22, 2003

*  Recent Topics

*  Fog Creek Home