Fog Creek Software
Discussion Board




Unicode and Character Sets article

The idea that Unicode characters fit into 2 bytes is not a myth. That makes it sound like some silly people leapt to that conclusion without a good reason. But:

"Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is required to specify any character in any language."

That's from the introduction to the Unicode 1.0 spec.

This survives today in the padded two-byte hex style that Joel uses in his article: U+FFFF.

He also says "There's UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn't be so bold as to waste that much memory."

Well, apart from those Texans who use Linux or Solaris! As far as George W. Bush is concerned, (sizeof(wchar_t) == 4).

But there is a serious side to all this - Joel seems to conclude that if you use wchar_t then everything is hunky dory. How can it be, if your OS (and mine) only allows two bytes for that data type?

The horrifying truth is that wide strings on Windows 2000 and later do not contain UCS-2. They contain UTF-16. Yes, there is a difference! UTF-16 is like UTF-8 - a single character may take up more than one wchar_t. So we're back where we started in terms of simple string handling. Unicode is unfortunately a victim of gradually changing requirements and has been extended in a way that makes it harder to write correct software.

e.g. wcslen - does that return number of characters or number of wchar_t elements? So how long do you need to make your array? And how about on NT 4.0?

And yes, there is a wide version of AnsiNext, called CharNextW.

Daniel Earwicker
Sunday, November 09, 2003

One thing I don't understand is, why do we need UCS-4 if UCS-2 already encapsulate all of the Unicode (as Unicode codes are U+xxxx) ?

Almonimus
Friday, November 14, 2003

Is it possible to ask you about this? It seems like you know the Unicode tricks ;) - and I'm stuck ..:

I don't understand why this does not work:

Printer.FontName = "@Arial Unicode MS"
Printer.Print "Hello " + ChrW(601)
Printer.EndDoc

This only results in a "Hello ?". The Arialuni.ttf is correctly installed, and the ChrW(601) - U+028C: Latin Small Letter Turned V prints ok from MS Word on the same computer. The font is also displayed correctly on screen in the fm20.dll textbox... So - I thought (hoped) that it was just as straightforward to print unicode letters on paper, but??

Terje Dahl
Friday, November 14, 2003

Terje,
What coding language are you using? what platorm? is your printer capable of printing that font using that coding language? (I'd suggest downloading some software built with your coding language and testing, MS Word does not mean anything)

Almonimus
Friday, November 14, 2003

many months too late but in answer to why you need 4 bytes is because unicode is not 16 bit.

Unicode 1.0 was 16bit.  Then, for example, they decided to add ALL 150,000 Chinese characters from all of history instead of just the 5000 in common use.  150,000 characters does not fit in 16 bits.  Hence, Unicode 2.0 and all these issues to deal with that fact.

The point of added all those and other ancient character sets like egyption is to make it possible to process all those characters as text.  Before they added those there was no standard way to represent them.

If we make it to outer space and find a million civilizations then maybe 4 bytes will not be enough but for now it will be for all eartly reasons.  2 bytes was not enough though.

Gregg Tavares
Tuesday, June 01, 2004

*  Recent Topics

*  Fog Creek Home