Fog Creek Software
Discussion Board

PHP, Unicode .. other solutions!

PHP has other problems than Unicode. There are very common security issues with many of the PHP scripts you can download everywhere. Basically these are mostly either not sanitizing data or failing to initialize variables properly. Plus it scales to ugly beasts of programs. Just look at whatever you find on Sourgeforge.

They say PHP5 will solve all this. To me it's just vaporware and no one knows if it'll improve things yet. There are plenty other good web solutions.

Perl was the web king before PHP and ASP and it still offers some tremendous improvements over PHP. It'll run your scripts much faster (which translates to more simultaneous users), it has a tainted mode for data which is a great help, and several mature web templating systems. Plus it handles Unicode well.

Please check out Mason( for very straightforward templating, rather PHP-like. For something with a little more steep learning curve but which scales very beautifully for larger codebases, look at Axkit(.org). PHP was intended for Personal HomePages, and it'll take a yet a couple of years before it matures beyond it architecturally. I'm sure it will, however, since Yahoo puts resources in it. But for now, I'll stick to Perl.

Jonas B.
Saturday, October 11, 2003

Am I the only person who read Joel's message,

"When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough."

and thought, "Wow, Joel's going to fix PHP!"?

Tom (a programmer)
Saturday, October 11, 2003

Ummm, very likely you were.

Simon Lucy
Saturday, October 11, 2003

Especially since ASP.NET is build on .NET, whose strings are all internally UTF-16. :)

Brad Wilson (
Saturday, October 11, 2003

I am a non-English speaker.

Frankly, I belive that everybody should switch to English.

Unicode is a bloody horror! :-(

Saturday, October 11, 2003

Code pages are a bloody horror. Unicode is tolerable.

Brad Wilson (
Saturday, October 11, 2003

Separate your content from your html!

Store your content somewhere as ( UTF-8 / UTF-16 / Favoured Encoding )  and then *translate* it to
whatever ugly glob of gunk it needs to be for ( Browser-X ) to see it.

HTML is just a format for rendering out stuff to a bunch of incompatible inconsistent rendering engines.

I am sick of people writing it talking about and viewing it as some kind of language!

Imagine if discussions went on like this about EPS or .TEX files.

Let the machines make the HTML and lets just manage the content because the content is really what matters.

Damien Connolly
Sunday, October 12, 2003

Frankly, I'm not starting any new work using anything but utf-8. It just makes the whole mess so much easier to handle. Like for string, sanitizing a string based on regular expression and POSIX Character classes is a snap. [:alpha:] expands to letters wether it's A, ß or ç. And everything else just sort of works so long as all your functions/objects are all utf-8 aware.

By the way, for those stuck with a lot of legacy PHP code (like me ;-() the mbstring module allows you to overload a lot of the text handling functions from regular expressions to strlen(). You still have to deal with nasty bits of code, but you won't get screwed by a forgotten strlen() not converted to a multi-byte aware version.

Sunday, October 12, 2003

The PHP internationalization nightmares had been reported on JoS even before Joel ever mentioned a PHP interest. I still believe there are few jobs (maybe only the "I had to do this very thing before, wrote it in C  and added it to the base PHP library"). for which PHP is the right tool.

Just me (Sir to you)
Monday, October 13, 2003

Jonas B.: Yeah when I was thinking of using php the first thing was I had to find the dozen of needles in the haystack (google) explaining exactly how to sanitize data properly.

From the looks of it you have to do three things:

Write a thin library that does this:

1. Let you declare that you only want to read environmental variable from certain sources. Like GET, POST, cookies.

2. Return the data from the right source based on your declaration.

3. Let you declare that you want to apply certain rules to the data.

4. Clean up the data based on your requirements.

You can have functions like:

(never assign the values returned by getexpect to temporary variables.. the whole point is that the you are worried Cookies, form vars can maliciously overwrite an any potential temporary variable)

With explicit settings in PHP4, you shoudl be able to restrict implicit variable creations too...

Someone correct me if I am totally off on this. Thanks.

Li-fan Chen
Saturday, October 18, 2003

Li-fan Chen
Saturday, October 18, 2003

*  Recent Topics

*  Fog Creek Home