Fog Creek Software
Discussion Board

Stupid regular expression tricks

I know this is simple, but I've been frustrated all morning to try to make this work.

What is the regex pattern that will allow me to parse a directory into separate directories?

For example, if I have "c:\dir1\dir2\dir3\dir4" what is the regex pattern so that it will return a match of "dir1", "dir2", etc?

I'm doing this in C#, but I believe dotnet's version of Regex is the same as in any other platform.

Not sure
Tuesday, February 10, 2004

Here's how i parse paths:
1. normalize the slash direction
2. see if the path begins with a double slash. Remove
    the first slash if so. Rememver if you care.
3. Collapse all multiple slashes to a single slash
4. remove a trailing slash
5. see if the path began with a drive directive.
  Strip it off if so. Remember if you care.
6. split on the slash into an array

son of parnas
Tuesday, February 10, 2004

Any reason you don't just use the split method with the \ as a delimiter and break it up that way?

Mike Treit
Tuesday, February 10, 2004

This is the problem with regular expressions - they're sold as being *the* tool for string matching, but it turns out that they can't actually match some very common patterns. For example, a regex can't be used to match arbitrarily nested parenthesis.

Unfortunately the explanation why requires that nasty Computability & Automata class that most CS students dread. :-)

I think String.Split is the much easier way to go here; forget the regex.

Chris Tavares
Tuesday, February 10, 2004

"Any reason you don't just use the split method with the \ as a delimiter and break it up that way? "


Man...some days you wonder how you manage to get yourself dressed. I can't believe I didn't think of using Split.

Not sure
Tuesday, February 10, 2004

I don't the .Net syntax but, in Perl this works:

$some = 'c:\dir1\dir2\dir3\dir4';
if ($some =~ /[a-z]:\\(.*?)\\.*/)  {
    print $1;

Parenthesis tell it "grab this"
'?' tells it to be non-greedy. 

You might need to do something weird with backslashes like \\\\ maybe not.

Tuesday, February 10, 2004

Seems like "c:\dir1\dir2\dir3\myfile.txt" parses into

  <some number of "xxxx\">
  <the filespec>

so "[A-Za-z]:\\([^\\]+\\)*.*" would seem to do it, no?

Tuesday, February 10, 2004

I don't think that regular expressions are oversold.  Everybody knows their limitations.  If you need nesting, use a parser derived from a general context-free grammar.  Regular expressions don't support nesting because, by definition, their production rules only allow terminals and epsilons.

The expression for extracting elements of a path, using boost regex syntax, looks something like:


There's probably a better expression to restrict characters of the path to whatever character class the OS defines for file/path names (rather than '.').  The above expression ignores the actual name of the file.

Tuesday, February 10, 2004

One other thought, how about using some of the directory methods in the Framework library?  If you just need to parse the names, there might be an appropriate method in the System.IO.Path class.

Otherwise, you could create an instance of the DirectoryInfo class for your path name, and then recursively traverse the parent folders.  I.e., create a DirectoryInfo for "c:\dir1\dir2\dir3\dir4", then get that directory's parent ("c:\dir1\dir2\dir3"), then get that directory's parent, etc.

Either option might be more robust than rolling-your-own parsing routine.

Robert Jacobson
Tuesday, February 10, 2004

Here's a quick sample that uses System.IO.Path to recursively get each subfolder's name:

using System;
using System.IO;
using System.Diagnostics;


string parent = @"c:\dir1\dir2\dir3\dir4";
string folder = Path.GetFileName(parent);

while (parent != null)
    parent = Path.GetDirectoryName(parent);
    folder = Path.GetFileName(parent);

The "GetDirectoryName" actually gets the name of the child directory, if the path ends in a folder instead of a file.

Robert Jacobson
Tuesday, February 10, 2004

I prefer the string.Split method, if only because I can convince myself at a glance that the code will work. A regex (that you haven't writen yourself) is a pain to read and figure out what it's doing.

If your code is going to be read by someone else, do them a favour and avoid complicating it with a regex for something that can just as easily be done another way.

Sum Dum Gai
Tuesday, February 10, 2004

Do it the simplest way.

If it's 3 lines with a regular expression or 15 lines with some other method, use the regular expression.  In this case, 12 lines isn't a whole lot to pay for what some people think is "added clarity," but that's not the actual cost.  The actual cost is 12 times the number of times over the course of your career that you'll have to make this compromise for other people.

Make your goal to implement each project in the smallest amount of code that you can (shifting common abstractions into core libraries as necessary), and your life will be much more pleasant.

Wednesday, February 11, 2004

Why regex ? Why don't you just split the string on "\\" ?

Wednesday, February 11, 2004

It's a frigging one liner without the regex, how much trouble is it to do that for clarity?

string[] parts = dirString.replace('\\', '/').split('/');

When I look at that, I can see instantly what's going on (normalise direction of slashes and split on them). None of the regexes listed gives me the immediate comprehension I get from the code example. That's not because I don't know regex (I use it myself sometimes), but simply because regexes are more difficult to read.

Even if it was 15 lines vs 3, the 15 line version may win out. Some lines are simple and some complicated. If it takes you longer to work out what the 3 line version does than the 15 line version, then the 15 line version is actually simpler. LOC is a crap measure of complexity (go take a look at some obfuscated C code entrants - all lines are not equal).

Sum Dum Gai
Wednesday, February 11, 2004

*  Recent Topics

*  Fog Creek Home