Fog Creek Software
Discussion Board




Webserver logfile analysis

Hi! I want to write a program that parses and analyzes webserver logfiles. Something like Analog or Webtrends. How would you do this? And how "is it done"?

Would you parse the file with grep? Would you consider it possible (not too time consuming) to parse the file into a database and get the analysis from simple SQL queries?

Speed is not so important, I just want to know how it is done.

Jeff
Thursday, May 29, 2003

If you write it, make sure it's a good one, not "yet another cryptic command-line tool that only works after 10 hours of configuration and reading cryptic references".

Being technically good is not enough. You have to have an excellent user interface!

John Krane
Thursday, May 29, 2003

Depending on your log format, you could easily come up with a simple grammar, like this

AllLog => { Log }*  //Zero or more Log
Log => TimeToken SrcNameToken DestToken FileNameToken ‘\n’

Then, create a parser and scanner with Flex and Bison.  Once the tree is generated, you can now do whatever you want with it.  With enough Bison and Flex experiences, it could be done in less than a day.  The hard part is coming up with an unambiguous grammar to avoid trouble with Bison.  Other tool, such as Accent can handle even ambiguous grammar.

The problem here is, it is specific to one format.  So as usual, there are always other solution, perhaps better, using different tools.

Good Luck

RM
Thursday, May 29, 2003

Just read the source to Analog or, if you don't quite like C, something simpler like PySiteStats (http://www.diveintomark.org/projects/pysitestats/).

That'll tell you exactly "how it is being done". Which, on the other hand, isn't terribly interesting question. Log file format is very simple to parse. It's logical structure is also simple. The interesting question is "how to get useful information from a web server log".

Krzysztof Kowalczyk
Thursday, May 29, 2003

If you're in the java world, do check out javacc for an excellent parser compiler: http://www.experimentalstuff.com/Technologies/JavaCC/

Chas
Thursday, May 29, 2003

Seems like there's VERY little lexing and parsing required for a log file. Most are simple, delimited file formats. No need to bring in the heavy duty yaccs 'n' bisons for web logs.

I'm using Webtrends because all the alternatives required too much painful configuration. Webtrends autodetects logfile formats from the major servers, which is a relief.

Joel Spolsky
Thursday, May 29, 2003

The complex thing with log files is getting things like path through site and how long someone's been on. Combining with cookies for more detailed reports is even more complex & requires a server side component.

I like Analog but use Weblog Expert Lite because it does what I need and requires zero configuraiton. I could get more detailed stats, but easier is easier.

www.MarkTAW.com
Thursday, May 29, 2003

i've not used the following tool from MS, but if you're looking at IIS (or windows) server logs, then i suggest taking a look at the Log Parser from MS. This tool lets you run SQL statements over a number of different log file formats including the windows event logs and as mentioned above, the iis log files.

http://www.microsoft.com/windows2000/downloads/tools/logparser/default.asp

ko
Thursday, May 29, 2003

i have done something like this - as a perl cgi script.
(there are tons like this if you search for perl cgi)

http://www.michaelmoser.org/auxloga/auxloganalyser.htm

Michael Moser
Friday, May 30, 2003

AWStats is free. The source is freely available, and it would be very difficult to write a better one.

Kent Design4Effect
Friday, May 30, 2003

AWstat looks nice. http://awstats.sourceforge.net/

www.MarkTAW.com
Friday, May 30, 2003

There is always something that you would like on top of AWStat - but it would be easier to do that as a stand alone script (if you mind to look at AWStat)

For example my stuff is like a bunch of directories and i would like to have seperate access reports per directory.

Michael Moser
Friday, May 30, 2003

I've done this before.

It's quite easy, actually; you just read in each line of the log file, and zip along the string looking for delimeters.  Once you find the fields you're looking for (IP address, page viewed, etc.), add them to an array or hash or map or whatever makes sense.  Personally, I kept a hash of hit counts for each page, where the keys were the names of the pages.  So, %page_hits{'/'} would store the number of hits for the root page.

Once you've read through the log file, print out the contents of the arrays/hashes/maps/whatever.

Brent P. Newhall
Friday, May 30, 2003

"Being technically good is not enough. You have to have an excellent user interface!"

Why, so lackey ass windows admins can run web analyzer software like their knowledgable unix admin cousins?
        

Mike
Friday, May 30, 2003

Google "webalizer"

Somebody already wrote it so you wouldn't have to.

Mike
Friday, May 30, 2003

Hi again and thanks so far for all the information. I know it has been done before but I want to do it myself, more as an exercise and because I sometimes look for very special information in my logfiles.

As I see it, most tools work like Brent already wrote: <em>"Just read in each line of the log file, and zip along the string looking for delimeters. Once you find the fields you're looking for (IP address, page viewed, etc.), add them to an array or hash or map or whatever makes sense."</em>

In case I need some additional information, would it make sense to parse the whole file again? Or is it better to transfer the whole file into a database and make SQL-queries? I want to be as flexible as possible but I also want to be able to work with huge logfiles.

Again, thank you very much!

Jeff
Friday, May 30, 2003

I haven't tried this yet, but saw it yesterday.  This article describes parsing logs into an OLAP cube for analyzing.  Hope it helps.

http://www.sqlservercentral.com/columnists/gciubuc/importingandanalyzingeventlogs.asp

shiggins
Friday, May 30, 2003

I like the database idea because of the flexibility in querying. I thought that was the bigest problem with AWStats.

It's the way to go and if you are considering a database, consider MySQL. It's free, the source is free and benchmark tests reveal it is faster than all except Oracle with which it ties.

Kent Design4Effect
Friday, May 30, 2003

Most log analyzers, I believe Webtrends also, doesn't actually store the log files in the database, it parses them and extrapolates information from them. It then stores THAT in the database.

If your site gets a ton of hits, the log files could add up very quickly.

You don't need every line & every request, once you've parsed the files, you can decide what you need - page hits, ip addresses, paths through site, and just store that.

You may lose the ability to run new queries, but you can keep your old log files around on your shiny new 360gig HD if you want to spend a couple hundred dollars...

Or, if like me your site doesn't get that many hits, you can keep your log files on your local hard drive & server.

www.MarkTAW.com
Friday, May 30, 2003

Jeff asked what to do in case he needs additional information.  IMHO, in that case, you just edit the source code to add the appropriate functionality and re-run the program.  What sort of extra functionality will you need?

I find that I learned much more by manually coding these things than by letting SQL do it for me.  It wasn't particularly difficult, and made for a good mental exercise.

Brent P. Newhall
Friday, May 30, 2003

*  Recent Topics

*  Fog Creek Home