Fog Creek Software
g
Discussion Board




modeling inheritance in a db

OK, so I've been wrangling with a database design question for a little while. I've talked to some people who I consider good at this sort of thing, and they're stumped as well. Given the ongoing interest in the object-db impedence mismatch, I thought I'd post to this forum and see if anyone here has any ideas.

I'm trying to design a data model that can handle inheritance (the particular use case doesn't matter: I need to keep track of an ordered list of widgets: the widgets can be either Foos (1 text field needed) Bars (10 ints and 10 text fields needed), etc (about 10 types total).

One important goal: I want to design the tables so that foreign key constrants keep the database from being corrupted (the team of programmers that will be building on top of my design are _really_ green, and any headaches at run-time will be my headaches).

Designs I've considered:
0)One table for each widget type (e.g. "Foo" Table) and a central "WidgetTable" that has a "Type" column (that tells you what table to look in for the actual data) and a WidgetId column (that tells you where to look in the table indicated by "Type" for the particular question).

This design doesn't meet my needs (there are no foreign key constraints keeping the tables from getting corrupted: ie, there is no way to guarantee that if the Widget table has an entry, that the data in the Foo or Bar table actually exists).

1)One table for each widget type. (e.g. "Foo" Table) A "WidgetTable" that has an identity column, and then an int column for each type of widget. That int column "Eg: FooID" has a foreign key constraint to the table for that particular question type (e.g. "MultipleChoice Question" table).

Better, but there's no way in SQL to guarantee that multiple columns in the "Widget Table" won't be filled in. So the "FooID" and "BarID" could both be filled in by a careless developer after the fact.

3)One mega-table called "Widgets" with a "type" column (indicating multiple choice, etc) and all the information for every type in it (so, columns for the foo data, the bar data, etc, etc).

Again, this doesn't enforce constraints. How do I guarantee that, for example, a row with "type" set to "foo" has data in the foo columns, and doesn't have it in the "bar" columns? Plus, it's really ugly and unscalable. It smells like bad engineering to me.

Now, I recognize that this problem could be solved at the programming level (i.e. in the stored procedures that access the database). But my goal is to design the database so that programmers can't mess it up. If I go with a stored procedure, I have no way of knowing that a programmer somewhere isn't using jdbc to muck with my data.

I found an article on the ibm site that talks about modeling inheritence relationships in databases. It's a good read, but it didn't help (he basically came up with the same solutions I did, and doesn't care as much as I do about making the design constrained so that the data can't be wrong.
http://www-106.ibm.com/developerworks/webservices/library/ws-mapping-to-rdb/

Another article dealing with the issue
http://c2.com/cgi/wiki?MappingInheritanceHierarchiesToRelationalSchemataInvolvesCompromises

So, it looks like this is offically a "hard problem". Any suggestions?

Oh, if it matters, the rdbms I'm using is sqlserver 2000.

mystified
Monday, April 5, 2004

"1)One table for each widget type. (e.g. "Foo" Table) A "WidgetTable" that has an identity column, and then an int column for each type of widget."

I'd go with this route.

"Better, but there's no way in SQL to guarantee that multiple columns in the "Widget Table" won't be filled in."

You could possibly use triggers to prevent this from occuring.

Almost Anonymous
Monday, April 5, 2004

How about 4 tables:
Widgets containing a WidgetId and a WidgetType
WidgetTypes containing a TypeId and a description of types
WidgetDataTypes containing a widgettype and a datatype
Widget Data containing a WidgetId, a WidgetDataTypes and a ntext and float column.  All data goes into the ntext column and numeric data gets copied into the float column for when you need to do math on the data (or you could just recast the data in the ntext field for math). 

in general you need to look for a meta data driven schema where your data can describe itself.  That should get you to a place where you can deal with inheritance in your db. 

K
Monday, April 5, 2004

What do you need the 'base' table for?  Something is either a foo or a bar, right?  The existance of a row in bar means it's a bar, existance of a row in foo means it's a foo. 

You can always do something like this:
CREATE TABLE foo( fooGUID GUID, ... );
CREATE TABLE bar( barGUID GUID, ... );

CREATE VIEW widgets( GUID, TYPE ) AS
SELECT fooGUID, 'Foo'
  FROM foo
UNION
SELECT barGUID, 'Bar'
  FROM bar

So if you have something that says:
I have a GUID, what type is it?
SELECT TYPE
  FROM widgets
WHERE GUID = 'ABC123'

MR
Monday, April 5, 2004

The XML school of DB design.

E.F. Codd
Monday, April 5, 2004

I am sure esteemed Dr. Codd is referring to my original post’s parent ;).

But as an aside, this is an important point to SQL products’ poor implementation of the relational model:
We (this is the ‘royal’ we) want a set of attribute values shared between one or more tables with no repeats.  This is almost the ‘base case’ for a relational domain.

The concrete example of a GUID datatype is sort of a ‘domain’ as the relational model specifies.  It has certain properties (length, character string, etc.), certain operations which can be performed on it (insert gives you a new ID, relational operators are defined since it inherits character string properties, etc.), range  defined via some sort of algorithm, etc.

If our DBMS does not support built-in GUIDs (quite likely) and if it does not support domains (probability nearing 1), we have to result to some sorcery in order to accomplish this.

I can think of a couple of all-SQL solutions, all of which will fit your requirements, none of which as simple and as elegant as a domain.

MR
Monday, April 5, 2004

Meta-data driven approach seems sexy but very complex: extracting one widget have me jumping all over the place in a big ass query? I just don't trust my sql skills to make that query performant.

I can't find a flaw in the approach itself though...just in it's suitability to a man of my modest abilities. ;->

I like the GUID idea too... SqlServer 2000 supports GUIDs, doesn't it? (via ROWGUIDCOL).

Of course, persisting objects or xml documents in the database would solve this problem as well. (That was my original design approach actually, and I may still go with that.

Does anyone else think that a trigger might be the way to go? It does seem like a simple way of locking down the data model, which is really all I'm looking for...

mystified
Monday, April 5, 2004

A simple possibility that will let you have your constraints:

table person (
id
name
age
)

table customer (
personId references person(id)
phone
state
)

table employee (
personId references person(id)
title
salary
)

Keep a reference in each of your sub-tables to the parent it inherits from.

Matt
Monday, April 5, 2004

Time for a diversion into ORM.

ORM (Object Role Modeling), lets you collect facts about data, define the constraints on those facts and when you're happy convert that conceptual model to first a logical ER equivalent and then to a physical database.

One of the concepts in ORM is Typing, Supertyping and subtyping.  You can treat it as a kind of venn diagram of populations of facts.  Those populations might be discrete sections or they might overlap in some areas.

Typically you control a super/subtype using some common field which is enough to differentiate them.  For populations where an instance of data could exist in more than one subtype at one time then you have collections of that common field, which implies some kind of masterlink record in an RDBMS.

For instance, in a current product we're working on there's a core of manipulating Contacts.  Contacts can be members of staff, managers, and in that role they can be sales people, managerial, florists, drivers and so on.  Contacts can also be customers, individuals at customers in a particular role, the responsible purchaser, the recipient of deliveries or entirely separate individuals.  Its conceivable that an individual Contact could have three or four roles in relation to the business and that a Contact might well be a customer at the same time as a member of staff, or that a contact at a customer, is themselves a customer individually.

This is simply accomplished using a single table of Contacts with a masterlink table of Contact and Roles.  That table simply has the two foreign keys.  In one column the key to Contacts, in the other the key to Roles.  Stored queries, or views then present the subsets of Contacts in particular roles as needed.

It would be entirely allowable though, to have separate tables of Contacts in different Roles if they could not overlap.

A Widget that was in a succession of states could be separated out that way, on the whole though I tend to keep things which are the same in one and only one place.

Simon Lucy
Monday, April 5, 2004

Knowing how best to model an inheritance hierarchy in a RDBMS require accepting certain tradeoffs which will depend upon factors such as how many instances of each class you have, the ratio of SELECTS to INSERTS to UPDATE, the depth of the inheritance hierarchy and the stability of the class structure.

Given four classes A, B, C, D in an inheritance hierarchy such that C is a B, B is a A, D is a A, there are three potential structures.

1. Aggregate all of the attributes in the subclasses 'up' into a single table containing all the attributes and a view per class that extracts just those attributes needed for each class. This relies there being no NOT NULL attributes in classes B, C and D. It is also convenient if your database allows inserts into the underlying database through the view.

2. A table per class and a view per class that joins the necessary class' base table to the view that represents that class' parent. i.e.

class A  --> table a  --> create view A as SELECT * from a
class B  --> table b  --> create view B as SELECT * from b, A
class C  --> table c  --> create view C as SELECT * from c, B

I use * here simply to mean all the attributes, you'd actually have to enumerate them (i.e. a.a1, b.b1, b.b2 etc.) to get them in the correct order. The WHERE clause can either use a natural key or a surrogate key (e.g. a GUID) as you like. ROWIDs are rather fast if you're using Oracle.

The reason that I use a join across the parent view rather than create view C as SELECT * from C, B, A is to avoid having alter the view for both C and D if the structure of B is altered. This can make maintenance easier and (so long as the query optimiser is any good) shouldn't matter at all as far as the database engine is concerned.

3. Aggregate all of the attributes in the superclasses 'down' in to a table for every concrete sub-class and create views for each concrete superclass that select UNION from the tables holding their sub-classes, i.e.

create view A as SELECT * from A UNION SELECT a1, a2 ... an from B UNION SELECT a1, a2 ... an from C UNION ALL SELECT a1, a2 ... an from D

You need these views because a query against all A's means not just just A, but also all sub-classes of A (because C is a B is a A).

As far as I know these are the only three possible ways of modelling a class hierarchy. They all have their advantages and disadvantages. (1) is simple, but I've yet to see a database with no NOT NULL attributes. (3) is good in that you only ever insert into a single table, but stinks if you have to refactor any of the superclasses as _every_ table needs to be changed. (2) gets round the 'fragile superclass' problem, but stinks if the inheritance hierarchy is deep because of the number of joins involved and, as very few databases seem to let you insert into complex views, because you need to insert into multiple tables. Oracle is rather better syntactically as you can associate an INSTEAD_OF trigger with INSERT on the view, but it still doesn't get round the join problem.

David Roper
Monday, April 5, 2004

Matt,
Your solution would allow a customer and an employee to refer back to the same person (one of the situations I'm trying to avoid).
I guess I should have specified that I'm not trying to model multiple inheritance...

mystified
Monday, April 5, 2004

I always get the impression this is trying to fit a round peg into a square hole.  Why use inheritence at all?

christopher baus (www.baus.net)
Monday, April 5, 2004

The best advice I can offer: go buy Martin Fowler's "Patterns of Enterprise Application Architecture". He covers this is GREAT detail, including alternatives with the trade-offs. Highly recommended.

Brad Wilson (dotnetguy.techieswithcats.com)
Monday, April 5, 2004

"Your solution would allow a customer and an employee to refer back to the same person"

Con you be absolutely positively 100% certain that an employee will never purchase anything? What if the company hires a former customer? What if a customer leaves and starts a consulting company using the products, thereby becoming a customer?

That kind of stuff always breaks your model.

  --Josh

JWA
Monday, April 5, 2004

This is a classic OO problem.  I can remember it being THE OO model used at Object Space back in the day. 

Anyways, use attributes instead of inheritence to model the employee.    The problem is inheritence -- square peg, round hole again.  It should be the model of last resort.

christopher baus (www.baus.net)
Tuesday, April 6, 2004

If all you need is to maintain an ordered list, inheritance seems a bit of overkill.

Add a field to each widget that holds its position in the list, and create an unique index on it. Yes, it's more hassle on the client side when moving things around (moving one widget may require updating positions of several others), but speaking of database integrity it's much safer, and easier to comprehend for your "green" programmers.

Egor
Tuesday, April 6, 2004

I get the impression from the OP's original description that he isn't necessarily modeling a situation where the "parent" types have data.  In that case, it might be worth conceptualizing as Oracle sequences shared between multiple tables.

I know we use Oracle sequences to implement per-table GUIDs more often than we use extra table/s to implement Oracle sequences, but either is reasonable.  I've found that sometimes the design feels 'cleaner' using the other mental model.

Mikayla
Tuesday, April 6, 2004

I still think this is putting the cart ahead of the horse, especially if there is no data members in the base.  This is real pet peeve of mine, but I have found time and time again that pushing object layers into the database fails. 

If the application is data driven, the design should be driven by the data schema, and not the run time object model.  Objects are a good model for in memory operations, but they don't persist, nor cross process boundaries easily. 

I think the biggest failure of the OO revolution was to push the concept of objects everywhere.  Even where they don't make sense (over the network) or when a well know solution existed (relational databases).  Using objects to model a data driven enterprise application is a bad approach. 

christopher baus (www.baus.net)
Tuesday, April 6, 2004

Create a "abstract" table that contains only object id's and their type.  Object id's are generated from a db sequence, so always unique.  The TYPE field holds the name of the "concrete" table where objects are actually stored (so you would have a FOO table, a BAR table, etc).  These tables do not have their own primary key, they just use the object id as foreign key.  So any id ("12345") appears once in the abstract table, and in exactly one concrete table.  This is enforced at the database level with triggers upon insertion and deletion.

Other than the id and possibly a name, the concrete tables could all have different fields.  Write whatever views you need to group the objects in various ways by common features.


create table object (
    oid        number(9) constraint pk_object_oid primary key,
    otype  varchar(30) not null
);
comment on table object is 'The root abstract class of all objects';
comment on column object.oid is 'Unique object identifier';
comment on column object.otype is 'Which table contains the concrete part of this object';


create table foo (
        oid            number(9)
        constraint pk_foo_oid primary key
        constraint fk_foo_oid references object(oid) on delete cascade,
        name    varchar(30) unique,
        created date not null,
        updated date
);

Biotech coder
Tuesday, April 6, 2004

have you looked at databases that support inheritence, such as postgresql??

They  all seem to do different things differently, but one might suit the type of inheritence you are trying to achieve.

Tapiwa
Wednesday, April 7, 2004

base the design on the task at hand....is there much reporting going on (data retrieval)?

apw
Wednesday, April 7, 2004

Thanks everybody! This board is really great.

I don't need to actually model inheritence per se (that is to say, there is no data in the base class). Sorry if using the "I" word gave the wrong impression.

My objects could never ever be of two types: trust me, it's definitely an either / or situation.

mystified
Wednesday, April 7, 2004

Some reporting is going on, by the way.

mystified
Wednesday, April 7, 2004

'Serialise' every class and store the objects that way. Similarly, have a validate method for each attribute which ensures that its relationships within the database are valid before you store it.

The hard work in this approach can be implemented high up the object hierarchy (i.e. once) and your greenhorns can implement the validations lower down.

You then need some mechanism to advertise the serialised object within the database in case something needs to do a referential integrity check against it.

The Stub
Wednesday, April 7, 2004

There are 3 things that are being confused on this board:
1) Native hierarchy queries in a database
2) Generalised constraints without foreign keys
3) Storing a tree in a database
Solve them 1 by 1.

1) Use PostgreSQL, Informix or DB2

2) Use the SQL 'CHECK' keyword. Shoot anyone who
    suggests triggers.

3) Either (a) store a damn path as a string, or (b) try to
    understand SQL genius Joe Celko's SQL tree scheme.

a) Store the path node IDs as slashes seperating fixed
    length words, so the length of the string is predictable
    and makes for easier substring operations. Use
    substring operations instead of hierarchy operations
    e.g. everything under node widgetA is everything with
    a left substring matching 'widgetA/'. Use a seperate
    field or two to store human readable pretty paths and
    node names, if the node names need to vary in length.

b) Google for Celko, SQL, tree

A
Thursday, April 8, 2004

I hate inheritance.

A
Thursday, April 8, 2004

*  Recent Topics

*  Fog Creek Home