Fog Creek Software
Discussion Board




How are Usenet newsgroup posts stored?

Hi:

I was wondering how Usenet posts are stored - are these on relational databases?

Or are they text files?

Interprid
Sunday, August 24, 2003

I would think that how the files are stored will vary from how the newsgroup server is implemented.

I mean, at one level, the NNTP server just has to exchange messages in the correct format. However, each newsgroup server is going to store its data however the designer of developer decided. Just like each accounting package for a the pc has a different format for their files. Or each vendor of a database server from Pick, to Sybase to MySql all have their own formats.

Without question, for any kind of decent performance you need the message id to be indexed, but after that, the actual internal storage is gong to vary from server to server depending on the version of the software they use for their newsgroup (NNTP) server.

I would bet that some NNTP servers actually use a database for the messages. It is really going to depend on the whims of the developer.

Windows 2000 server does have the NNTP servies installed and running by default, but I have never bothered to actually look at what the format of the files used. Since Exchange server used JET, then that is a possbility.

However, while most of the data is text, there is the need for indexing some of the stuff, so, it is more then just a plain text file used. There certainly could be ISAM files added in addition to the text file, but I not really sure.


Albert D. Kallal
Edmonton, Alberta Canada
kallal@msn.com
http://www.attcanada.net/~kallal.msn

Albert D. Kallal
Monday, August 25, 2003

"Windows 2000 server does have the NNTP servies installed and running by default,"

Not to drag securtiy into things here, but...

WHY THE HELL IS THAT TURNED ON BY DEFAULT???

Ted
Monday, August 25, 2003

>>Not to drag securtiy into things here, but...

>>WHY THE HELL IS THAT TURNED ON BY DEFAULT???

I totally agree. It is possible that I am mistaken on this one, and the admin who setup the server just enabled because he did not know what NNTP stands for!

I had to ask why is it turned on, and then they disabled those services!

I don’t really know if it is a security threat, say more then enabling a web server, or email services. I mean, when you setup a server, there is a TON of services you get. It is possible that web services were enabled, and you get NNTP WITH that setting. In fact, you get ftp serves, NNTP, and a few more others for mail enabled.

I mean, if you are connecting your server to the net, you better have someone who knows what the heck they are doing anyway.

If any of the “personal” editions of windows installed a NNTP server by default, that would be stupid, and is most certainly not the case.

For the server editions of windows, then we are talking about a very different situation. It is likely that NNTP along with stuff like ftp is enabled when you installed the web server stuff.

I could well be corrected on this issue. However, I am not sure if it is a security issue so much? I mean, why expose a server to the net, and THEN WORRY about stuff like NNTP services? Authentication etc is still going to be needed to use those services.



Albert D. Kallal
Edmonton, Alberta Canada
kallal@msn.com
http://www.attcanada.net/~kallal.msn

Albert D. Kallal
Monday, August 25, 2003

NNTP is a sub component of IIS installation but has to be separately indicated to install it.

SMTP is also a subcomponent but more irritatingly is a default.

Simon Lucy
Monday, August 25, 2003

As someone who has spent over a year working on a SQL-based NNTP system, I can say that while SQL is wonderful in many ways, its a serious pain as a choice for NNTP.

For instance, in the real world, the variants of SQL often struggle with weird characters (/,",' etc). And usenet posters LOVE to fill their subjects with garbage. Of course one can write defensively around this, but its still a pain.

Then, most variants of SQL really struggle with freeform text searching. For instance, if you are looking for the string "*java*" in all the subjects, SQL really bombs when it comes to doing a LIKE "%java%", because it can't index that. It only indexes string starts. So when you have millions of records, prepare for a big resource hit on searches. A related problem is that your indices are as big if not bigger than your actual tables.

I should point out that I only capture the subjects, not the contents of messages, although presumably the same problems would occur. They may in fact be worse, because of attachment handling.

Still, given these issues, having the data accessable in SQL form does make it very easy to develop applications around the data. Particularly when it comes to writing applications in a variety of languages to access/manipulate this date.

However, for the genuine NNTP servers as you can find at your ISP, they most definitely do not use SQL.

For some details on a common news server (INN), see http://www.mibsoftware.com/userkt/inn/0036.htm

Chris Welsh
Monday, August 25, 2003

The "traditional" storage (used in Cnews and INN) is as follows:

Each element of the newsgroup name corresponds to a directory in the filesystem, underneath the spool directory. e.g. comp.os.linux maps to

/usr/spool/news/comp/os/linux.

Each message resides in a file whose name is the article number, which is specific to the news server.

There is a "history" file which is a database mapping Message-IDs to (newsgroup,article number) pairs. This database is traditionally maintained using a variant of Berkeley DB.

This layout was reasonable for news servers up until a few years ago. The main performance bottlenecks are:

1. Some Unix systems bog down on directories with thousands of files in them.  Newer systems that use B-tree directories fare better.
2. A significant amount of time is spent doing "expire" - removing old articles from the spool. The random-access nature of this activity coupled with the access to the history file was particularly bad for performance.

I have no idea what newer systems do.  Application-specific file systems can easily address these problems.

David Jones
Monday, August 25, 2003

"I totally agree. It is possible that I am mistaken on this one, and the admin who setup the server just enabled because he did not know what NNTP stands for!

I had to ask why is it turned on, and then they disabled those services!"

You are mistaken about the default install - the 1d10t admin was probably told to "install IIS" and just drooled on the keyboard until the right keys shorted out to make up "Select All" in the options.

You are right on an "it depends" basis about how IIS stores posts:

The default NNTP service in IIS uses directories I think similar to the unix method outlined.

If you have Exchange then of course Exchange takes over because its funny like that and creates a public folder hierachy that replicates the newsgroup hierarchy and uses that. And public folders are of course stored as part of the information store in a JET database.

Robert Moir
Monday, August 25, 2003

Is Usenet relevant these days? Seems like it bit the dust several years ago.

pb
Monday, August 25, 2003

Nah that's not really true... in some ways it is unfortunate that it is being replaced by yahoo groups and such for your casual AOL-type user.  But it is still used by microsoft and a lot of companies for support for relatively technical users.

I guess the problem is that most people won't bother to enter the NNTP settings in their outlook express settings.  A lot of people just use web mail to avoid this hassle.  Another problem is all the spam in the newsgroups.

Andy
Monday, August 25, 2003

Yes, Usenet is still relevant.  It's not as big of a force as it once was, but there are plenty of vibrant communities still on Usenet.

The Pedant, Brent P. Newhall
Monday, August 25, 2003

Most NNTP server still work the way they always have. Going to a relational database would not necessarily be a good thing seeing as a normal ISP setup for a NNTP server involves about half a terabyte of storage at least. Either that or dropping  the alt.binaries.* hierarchy (which is mostly porn, incidentally it's also the most popular subset of any ISP news server.)

This is similar to e-mail, I've yet to meet an ISP that used RDBMS to store customer mail.

Now, metadata about NNTP messages being stored in RDBMS makes a lot of sense. There's still the problem of purging the expired messages (my ISP keeps about 6 to 9 months of messages in most of the newsgroups I read and I'd expect 3 to 5 days in the porn, mp3 and warez binary groups, they have a set of rules which expires messages based on age, size, average size, etc.)

Alex
Monday, August 25, 2003

Another serious problem with USENET is the complete lack of moderation and susceptibility to automated attacks.

The tech also hasn't evolved in a decade.  The tools used for moderating newsgroups pale compared to the moderation tools available on webforums.  Most companies I know of that used to run news servers have moved to web forums.  Installshield is one of them.

Strangely enough, Microsoft isn't.  I wonder why that is really, because I figure they'd be one of the first people to put together a passport-using .NET application with built in links to Outlook and IM services....

Alex
Monday, August 25, 2003

INN is the most widely used Usenet server software out there.

The latest version of INN provides a variety of databases, none of which are a traditional RDBMS. The operations a news server needs aren't really well adapted to a DBMS. The options I'm familiar with are TRADSPOOL and CYCBUF for the messages themselves, and periodically regenerated flat files for the indexes.

TRADSPOOL is the traditional one-file-per-message spool used by Usenet since time immemorial (well, 1981 or so).

CYCBUF is interesting. Each buffer is a fixed size file, with a pointer, and messages are written sequentially after the pointer (similar to a pure log file system). It's a circular buffer, so when it gets to the end of the file it wraps around and starts overwriting old messages. To keep the alt.binaries.warez.movies.porn.etc groups from filling the buffer, you can allocate different buffers to different sized messages or different group hierarchies.

There are a couple of other strategies but I'm not familiar with them.

Peter da Silva
Tuesday, August 26, 2003

As for newsgroups versus webforums... the advantage to webforums is that the operator of the forum is a single point of control. The disadvantages are that the operator is a single point of failure, there's no effective offline mode, and there's no alternative to the web interface the server provides.

Effective moderation and central control of newsgroups is certainly possible. Several years ago, when the abuse level on Usenet was particularly bad, I implemented a series of experimental newsgroup hierarchies with an eye to providing a saner alternative to traditional Usenet. As it turned out the new "Usenet II" wasn't needed: the worst of the abuse was able to be controlled by filters and redlists, and possibly by the growing attraction of website vandalism as an alternative outlet for what some might term "youthful hijinks", but some of the earlier experiments have continued quite successfully.

The Usenet technology itself is also widely used in "disconnected" mode by many companies, who set up standalone news servers isolated from Usenet proper. Sometimes these (like the microsoft.* groups) later become distributed alongside regular Usenet groups.

Peter da Silva
Tuesday, August 26, 2003

*  Recent Topics

*  Fog Creek Home