Fog Creek Software
Discussion Board




How many times can you half a number?

I have nodes with two numbers, left and right.  A node can be said to be a child of another node if it's two numbers fall between the parent nodes left and right.

So A(1,4) is a parent of B(2,3)

But what if you want to add another node C, also a child of A?  You have to shift A's right number over a bit to make room.

A(1,6) ->
    B(2,3)
    C(4,5)

It would be better if you could just split the difference between the rightmost child and the rightmost parent, so that C(3.3, 3.6), and you didn't have to potentialy update a whole chain of siblings.

But how many times can you keep doing that?  I coded up a simple loop and the answer popped out as... about 32.  I upped the variable sizes to doubles and the answer was... about 32.  I tried using numeric (Using SQL here) with 1000 digits of precision and still, ~32.

What gives?  Is this obviously the answer?  Is 32 the new 42?  I was really hoping for something more reasonable like 1,000,000...

Numerically Challenged
Saturday, April 24, 2004

Yeah 32 sounds about right :)

May I suggest the BASIC line number algorithm instead: assign numbers 100, 200, 300. To put something between 100 and 200, put it at 150. Between 100 and 150, put it at 125.

The first time you run out of room, go through and renumber everything from the beginning to the end, leaving 100 number gaps between for each one.

Another solution is to use strings of arbitrary lengths (text/memos) and sort alphabetically. So you get A, B, C. Between A and B you put AM. Between A and AM you could put AF. Between AE and AF put AEM.

For more efficiency use varchars limited to, say, 8 letters, and "renumber" when you run out of room.

The code is not going to be pretty. Relational databases do a pisspoor job of storing trees (I deal with this problem every day with CityDesk; currently we have the lame algorithm of renumbering every sibling whenever you insert.) Oracle has some proprietary built-in features that make it a tiny bit easier.

Joel Spolsky
Fog Creek Software
Saturday, April 24, 2004

You are using a way too complicated design to store parent-child node relationships. Just give each node a unique ID and a parentID. To create a new child for a node you just set the node's parentID to the ID of the parent node.

Then to select all children of a node with for example ID 12 you would have something like:

SELECT
  NodeName
FROM
  Nodes
WHERE
  ParentID = 12;
 
If you need manual sorting you still need a SortID field for which you might need to renumber every time a custom sort is performed. However, the number of nodes involved in the renumbering the childs of a single node will be orders of magnitudes less than looping through all nodes.

About the number of times you can half a number: it depends on the precision on the floating point type you are using and the amount of bits used to store it. Typically 32 bits float have ~8 significant digits. In your case you found this out experimentally. 1/(2^(32)) ~= 2e-10.

64 bits floats go to ~15 digits. I would not create code however that depends on this as the number of significant bits can be different on different platforms and compilers. Some compilers have a type where you can set the number of significant digits, but these types generally have piss poor performance and should be avoided if possible.

Jan Derk
Sunday, April 25, 2004

One more remark about the SortID. The way I have described is, is how I implemented it in a certain application.

However, thinking about it a bit more, it struck me that using a SortID field is kind of stupid too.

A much cleaner and efficient design can be created by adding PreviousSiblingID and NextSiblingID fields to the node's data structure. That would totally get rid of the ugly renumbering stuff and work in trees with millions of nodes.

Jan Derk
Sunday, April 25, 2004

I like the PreviousSiblingID and NextSiblingID, because then you can treat the children of a given node as a doubly-linked list.  It seems like that would make handling that set of nodes a lot simpler, since every CS student had to do linked lists at some point.

However, I can't come up with a way to get a recordset containing all the children of a certain node, sorted in the order of the linked list.  It seems like I'd have to get the first child node, and then keep querying for the next item in the list until I have them all.  Does someone with better SQL skills know a way around that?

Emperor Norton
Sunday, April 25, 2004

One would first load all children data with an unsorted SQL query and create the node objects. Then one can traverse through the created child nodes in a sorted way by using GotoFirstChild, GotoNextSibling methods of the node class.

I just checked the data structure of Mike Lischke's excellent Virtual Treeview Delphi component
http://www.soft-gems.net/VirtualTreeview/VT.php

and it works exactly like this. Each node has pointers to its parent, previous sibling, next sibling, first child and last child. My thought was that one would obtain the first child by finding the child with an empty PreviousSiblingID, but a pointer to the first child would surely be much more efficient.

Jan Derk
Monday, April 26, 2004

We would usually give a unique character ID for each node, it will be divided into "levels", so first 2 (or it depends whatever maximum possible number of node you want) characters for root level, next 2 for children etc.

AB
  -ABAA
        -ABAAAA
                    -ABAAAAAA
        -ABAAAB
  -ABAB
        -ABABAA
        -ABABAB 
        -ABABAC

We index the field and although addition of new nodes may take a bit longer search is quite fast. Besides lets say you want to find all subnodes of ABAA branch - you make a simple select for all records starting with ABAA - something you can't do easily with the proposed "parent ID" method.

Works well and fast when you don't update the tree extra frequently and will do for most of applications.

Before we started using that we had created tables using different methods for keeping child-parent relation, populated with hundreds of thousands records and run different speed/space measuring tests. Took about 2 hours.

Vlad Gudim
Monday, April 26, 2004

There are 3 different approaches I've seen:
1) ParentID: have a foreign key referring to the PK of the parent. a NULL parentID means that you're at the top of the tree
2) pathname: have a long string containing the path of the element e.g. "grandparentID.parentID.childID"
3) use Joe Celko's approach with LeftID and RightID that the original poster is using.

1) is fast, robust and convenient for most tasks but makes it difficult to get a list of all the descendants or ancestors (as opposed to just the parent or children) of a given record without sitting in loops and building temporary tables.

2)  require triggers to maintain referential integrity, but allows for efficient insertions, and lists of all descendants are easy to find with a "like 'grandparentID\%'" clause (lists of all ancestors can easily be done if you store a reversed pathname as well. As well as the RI problem, this approach also has an upper limit on the depth of the tree, depending on how long you make the pathname field.

3) allows for more efficient ancestor/decendent searches than 2) (straight numeric comparion, rather than using strings), arbitrary depth of trees, but does make insertions/deletions a bit slow.

All of them have compromises in different areas: you just have to choose the one that hurts you least

Mark Charsley
Monday, April 26, 2004

Thanks Mark for stepping in and providing the clear insight. I didn't even know there was such a thing like Celko's approach.

Jan Derk
Monday, April 26, 2004

Here is a thorough explanation of what Vlad was talking about.  I think its the most flexible and best performing of any solution I've seen but I haven't been able to try it in a production environment yet.

http://www.yafla.com/papers/sqlhierarchies/sqlhierarchies.htm

CharlesC
Monday, April 26, 2004

Regarding hierarchy trees, this is the approach I have used in the past.

Pros:
- Fast
- Easy to get ancestors
- Easy to get children and leafs

Cons:
- Preprocessing required
- Need to renumber on insert or delete, you might be able to do arrange codes (see below) to minimize this

Steps:
1) Assign a unique ID to each of your nodes. Order is not important as long as each node is unique.
2) Build your hierarchy, In my case I use a linked list in Delphi. You could probably be clever and do this in SQL.

Example:

A
  -B
        -D
                    -I
        -E
  -C
        -F
        -G
        -H

3) Do a left hand traversal, starting with the root and incrementally assigning a number. Lets call this number the FC (Function Code)
D = Depth

D1  D2    D3    D4

A 1
    -B 8
          -D 10
                -I 11
          -E 9
    -C 2
          -F 7
          -G 4
                -J 6
                -K 5
          -H 3

For each node, store the maximum FC of its children. Lets call this the FC Range.

A 1,11
    -B 8,11
          -D 10,11
                -I 11,11
          -E 9,9
    -C 2,7
          -F 7,7
          -G 4,6
                -J 6,6
                -K 5,5
          -H 3,3

Now for each node we have the following properties:
- UniqueCode
- FC
- FC Range
- Depth

Other useful properties that you could store:
- Parent Unique Code
- Parent Function Code
- Child Count
- IsLeaf

4) Store all these properties in a table "MyTreeInfo"

So what is the point of all of this?

i) Well lets say you want all the nodes for all the children of Node C?

select *
from MyTreeInfo
where FC between 2 and 7 -- Function Code and Range of Node C

ii) Who are the direct Children of Node B?

select *
from MyTreeInfo
where Parent Function Code = 8 -- FC of Node B

iii) Who is the parent's parent of Node I?

Node I is at depth 4, therefore we are looking for a node at depth 2.

select *
from MyTreeInfo
where Depth = 2
and FC <= 11 AND FC Range >= 11

You will find that only one node will fit this criteria and the answer.

At any rate, there are many other possible queries that can be answered using this approach.

I hope people made sense of my ramblings, if not send me some email and I will help you out.

Cheers!


Odin

Odin Ortiz
Monday, April 26, 2004

Hey that's funny - I post in these parts, and suddenly saw a bunch of hits against my page coming from these discussions.

Dennis Forbes
Monday, April 26, 2004

Slightly offtopic:

A cool side effect of using the node/relationship model to store a tree is that it matches perfectly with the academic work that's been done on directed/undirected graphs. If you have to display any of these trees, there are very robust layout algorithms available. Bell labs even produces a free program called "dot" which does this sort of thing (albeit in a Unix-y command line sort of way).

Rob VH
Tuesday, April 27, 2004

*  Recent Topics

*  Fog Creek Home