Fog Creek Software
Discussion Board




Designing a numbering scheme for a knowledge base

Hi all, I'm the newly named, horribly underexperienced and unskilled librarian of the (doomed from birth) ReactOS project ( http://www.reactos.com/ ), and I'm about to start the nearly impossible task of enforcing some discipline in documentation writing on a heterogeneous group of developers who never met in real life - good thing I'm an optimist

The documentation project is still in its very early stages of design, but I've already decided to organize it into three Books/Manuals (User, Administrator and Developer), and a Knowledge Base kind of resource for everything (tutorials, FAQs, release notes, etc.) that doesn't really fit in a book

And it's the Knowledge Base I'm the most concerned about, at the moment (since the ridicolously tiny amount of documentation we have at the moment consists almost exclusively of tutorials). In particular, I need a numbering scheme that guarantees unique identifiers, that are easily recognizable as KB article ids even out of context (I liked Microsoft's Q<number>, but they have dropped it for reasons I don't really understand). I don't know, maybe using the date and hour of when the article was originally issued?

Suggestions? Ideas? (please, no ReactOS-related flames or questions on the forum - mail me instead)

KJK::Hyperion
Tuesday, December 10, 2002

Are you going to store these in a database?  Why not let it take care of that for you.  Just use an auto-increment or GUID field.  There's little to be gained by numbering them yourself.  All you really want to do is avoid collisions, right?  And if you want to prepend a 'Q' when you talk about them, knock yourself out.

Brian
Tuesday, December 10, 2002

Another possibility to look into is an MD5 checksum. An advantage / disadvantage is that it will change if the content changes. You have to look at your situation to decide which of those might matter to you as compared to a db sequence value or a GUID that have no connection to the content itself.

perl's got a nice one, Digest::MD5, that's pretty flexible in the type of output it provides.

anonQAguy
Tuesday, December 10, 2002

> Are you going to store these in a database?

whoah! I always seem to miss the obvious

On second thought... a database seems the best solution, but that will create other problems. A technical problem I immediately see is versioning. We can do (and we currently do) that super-easily with text files in CVS, but with a database? And CVS allows people to easily download part of the whole thing on their disk, work on it, and upload the updated version - I'm not sure how and if can it be done with a database. Finally, I don't think I'm in the position of imposing yet another tool on the rest of the team (last time someone tried, he risked being lynched) :-/

> Why not let it take care of that for you. Just use an auto-
> increment or GUID field.

A GUID is overkill, and an auto-increment creates numbers with a variable width - ok, not the end of the world, but it doesn't look as good, and it isn't as immediately recognizeable as a fixed-width number (actually, I was, more than anything else, wondering what meaning the Microsoft KB numbers have, if any)

BTW, I think I'll settle with Ayymmddhh (year-month-day-hour, "A" having an undefined meaning - "answer" or "article" or something like that). Auto-increment (as long as the Earth keeps spinning - but when it stops, the ReactOS Knowledge Base will be the last of my worries), unique, can be generated automatically, has a fixed width

Another thing I wanted to ask, but forgot: single numbering scheme for all sub-products/projects, or a different prefix for each? Does the latter still make sense? (they told me it used to)

> And if you want to prepend a 'Q' when you talk about
> them, knock yourself out.

I don't think I understand the meaning of "to knock yourself out" in this context

KJK::Hyperion
Tuesday, December 10, 2002

<quote>
A GUID is overkill, and an auto-increment creates numbers with a variable width - ok, not the end of the world, but it doesn't look as good, and it isn't as immediately recognizeable as a fixed-width number.
</quote>

Obviously, if you want a fixed width number and are using an auto-increment database column you would build a view to do the formatting for you.

Matthew Wills
Tuesday, December 10, 2002

Guys, don't overarchitect.

A view for a fixed width number? Whoa. First of all,  you can just use formatting during e.g. SELECT (every database has it's own formatting function). But a much simpler way is to number the first item 100,000 (or 171819 or any other number you fancy that leaves ~800,000 items within 6 digits). Instant width, and it looks better than item 000001 too, IMHO.

Or, you could go for something more revolutionary, and adopt a naming scheme that can be memorized better. E.g., using a system similar to Oren Tirosh's Mnemonic Encoding system [ http://www.tothink.com/mnemonic/ ], you can encode any 6 digit value using just two words from the list in [ http://www.tothink.com/mnemonic/wordlist.txt ]. Instead of labelling a knowlede base item "Q387123", "Q123907" and "Q901231"  like Microsoft does, you'd have items with names like "crimson dolby", "ferrari jackson" and "maxwell lithium". Much easier to spell out, memorize, etc.

If any reader DOES want to follow this lead, be sure to make the number assignment unique but seemingly random (so that sequential items DO NOT get similar mnemonics), and also make it order independent (cuts down to ~600,000 items, but makes it easier still to memorize; and you can always add another word for more space). Both are easy to achieve, but I'll be glad to help if you plan to implement any and have trouble.

Ori Berger
Wednesday, December 11, 2002

..... and of course, I forgot to add, this is overkill as well  - people have come to expect items to have a hard-to-memorize number, so a numeric ID works relatively well in practice. But it could be communicated between humans much more effectively using mnemonics, and it is just as easy to retrieve by a computer according to a mnemonic as it is by numeric ID.

Ori Berger
Wednesday, December 11, 2002

I did some work on a knowledge base this year where the data was in the form of Q&A's, not exactly an FAQ.  The question would remain the same over time but because of the nature of the domain (rules of conduct of lawyers), answers could vary depending upon _when_ the question was asked.

The knowledge base is stored as a set of questions, related answers, dates of validity and notes on the answers.  In publishing the _current_ question and answer, with notes is published as a web page within the application space.

So you get to squeeze the concept of a browser into it and let them define internal and external links.

In the end it turned into a content management system.

I've done quite a bit over the years with time sensitive information that needed to be kept intact and in context, tax information is one of those kind of areas.

For regular versioning, where you're only interested in the last version stored for retrieval, but you want to keep all versions for tracking, auditing purposes, you could store the change record as the number of seconds elapsed this year (plus a signifier for the year).

That sorts versions without too much hassle and its easy to generate.

Simon Lucy
Wednesday, December 11, 2002

The basic principles of IDs is: don't attach ANY meaning to them. Please. The bank I'm with thinks its clever that they can tell what local branch office you're at just by looking at your account number, but it royally annoys users that their account numbers change when they move to another city, even if they stay with the same bank.
Using an auto-number is a good idea, but don't use the internal autonumber your database table will be using as primary key. It's going to bite you in the ass if you want to split the database to different local servers, if your database ever grows too large and you want only recent articles in one database and older ones archived elsewhere, if you want to change your database layout, etc.
Numbering with seconds-elapsed schemes usually are also going to lead you to problems if two new documents are submitted simultaneously. "But I'm the only one creating new documents!" If your system is successful, it won't stay that way for long.

I read you're concerned with fixed-with, you could pad your numbers or start at 100000 as Ori writes. But what if you ever have more than 900'000 articles? Keep your life simple, go for variable width. Are there any reasons, besides esthetical, that you want fixed width?

HTH

Yves

Yves
Wednesday, December 11, 2002

What about just using FogBUGZ?  Sounds like your situation is similar to mine.  We are testing it out here, and it looks pretty good.  The drawback is if you want the knowledge base to be public (on the web).  For a private knowledge base, I don't think you need to build your own.

I know this doesn't answer your question, but I thought I should mention it.

Scott Stonehouse
Wednesday, December 11, 2002

I'm all for using a unique number to name items in the DB.

As for prepending "Q", well I think that either needs to have some meaning for you or you need to forget about it.

Why not use the prepended letter as a coded guide to the type of article - it won't hurt anyone who doesn't know the scheme because typing "Q123456" or "A123456" doesn't make any odds to them, but it might help advanced users.

For example
Q - "standard" knowledgebase bug / issue reports with work arounds
S - Security issues
H - How to's
P - Change notes for product patches
etc...

With just a bit of clever web design, it would be easy to allow someone to visit a dynamic page that shows the last 10 new articles in each category, or to create pages that  allow users to easily review lists of all patch change notes, or security bulletins, scan a list of "how to" articles, etc.

Rob

Robert Moir
Wednesday, December 11, 2002

> Instead of labelling a knowlede base item "Q387123",
> "Q123907" and "Q901231" like Microsoft does, you'd have
> items with names like "crimson dolby", "ferrari jackson" and
> "maxwell lithium". Much easier to spell out, memorize, etc.

So, when someone asks "how do I run ReactOS in VMWare?" I tell them "see KB article "crimson lithium""? no, really. These identifiers aren't for humans, I don't want to force people to remember them (I see the possible applications, though - and I remember that PGP used a similar alternative representation for key fingerprints)

And I eventually want something like an URL scheme for KB articles (let's say kb://), and the numeric id works much better for this

KJK::Hyperion
Wednesday, December 11, 2002

> What about just using FogBUGZ? Sounds like your
> situation is similar to mine. We are testing it out here,
> and it looks pretty good.  The drawback is if you want the
> knowledge base to be public (on the web). For a private
> knowledge base, I don't think you need to build your
> own.

Like I already said, I'm not some sort of project manager. It's an open source project, we're all peers, so I can't just decide a solution and force everyone else to adopt it (I'm not very popular with the rest of the team either)

We have already chosen both the bug tracking *and* the content management system (respectively, Scarab from Tigris and ezPublish), so I have my hands bound here. I have to design something that integrates into ezPublish (and this means XML - but we've already chosen DocBook XML for documentation), and it doesn't need to be a full-fledged bug tracking system - just a repository for short HOWTOs (those that outgrow the KB will be made into full Guides and integrated in the appropriate book) and release notes, like Microsoft's

Finally, FogBUGZ, AFAIR, runs on ASP and requires Microsoft SQL Server - our servers currently run Linux, and even when ReactOS will be able to take over the task, it'll still be a long way until IIS or MSSQL run on it (and we'd still need a Windows server license, IIRC)

KJK::Hyperion
Wednesday, December 11, 2002

> Why not use the prepended letter as a coded guide to the
> type of article - it won't hurt anyone who doesn't know the
> scheme because typing "Q123456" or "A123456" doesn't
> make any odds to them, but it might help advanced users.

I'd rather implement this with keywords - but I'll think about it anyway

KJK::Hyperion
Wednesday, December 11, 2002

> I don't think I understand the meaning of "to knock yourself out" in this context

My point was that part of what you are asking about is simply presentation.  Your article ids can simply be integers.  If you think it is better to refer to them as 'Q' followed by 6 digits, you can present them this way, and just have the interface translate (i.e. drop the 'Q').  Maybe that's not great, and can lead to confusion (about what the 'real' id is), but at least for the fixed width number, it's a no brainer (when casting to an int, it doesn't matter how many zeroes are in front (unless it changes it to an octal number)).

The idea of having a hash of the article be its id had occurred to me, but then you run into trouble with versioning and preserving links when an article is updated.  You could use it to produce the first and permanent id for the article (and its later versions), and not attach meaning to it beyond that point.

Brian
Wednesday, December 11, 2002

"I'd rather implement this with keywords - but I'll think about it anyway "

Fair enough, but you say you like the idea of having a letter like "Q" in front of your KB article numbers. I was saying theres not much point if it doesn't mean anything, its more data to store, more for people to type, etc.

You talk about wanting to prepend a letter to make it look nice, and having a fixed width for consistancy, but I can't help thinking that if you don't know the basics of how to index these articles then worrying about how it's going to look on screen is very much putting the cart before the horse.

Robert Moir
Thursday, December 12, 2002

> Fair enough, but you say you like the idea of having a
> letter like "Q" in front of your KB article numbers. I was
> saying theres not much point if it doesn't mean anything,
> its more data to store, more for people to type, etc.

I think I'll eventually stick with just a number. Faster to index, and to make a resource identifier universal... well, you make it an Universal Resource Identifier (URI), of course. kb://<number> should do

> You talk about wanting to prepend a letter to make it
> look nice, and having a fixed width for consistancy, but I
> can't help thinking that if you don't know the basics of
> how to index these articles

Exactly. Indexing, and especially storage in indexed form, and automatical conversion from XML to database, is going to be the hard part. I'm going to discuss it with the project coordinator

KJK::Hyperion
Thursday, December 12, 2002

*  Recent Topics

*  Fog Creek Home