Fog Creek Software
Discussion Board




Unicode

Hi,

I have been recently learning about Unicode and read various articles on the topic (including an article from Joel). I would like some clarifications on the use of the various encoding schemes on windows.

I have made some small tests and seem to conclude that:

1. UTF-32 is not supported on Windows.
2. UTF-16 is what _UNICODE refers to in the Win32 API.
3. UTF-16 doesn't work at all on HTML files.
4. UTF-8 works well on HTML files for 1-byte, 2-bytes and 3-bytes long character sequences, but not for 4-bytes long character sequences (need some extra fonts?)
5. UTF-8 doesn't work with the Win32 API (_mbsinc(), CharNextExA(), etc.) apart from conversion using WideCharToMultiByte with the CP_UTF8 code page.

My questions are:

How well is UTF-16 supported on Windows? Does it support 4-bytes long character sequences?

Isn't UTF-8 really supported on Win32? What is a multi-byte character string exactly?

Cheers.

Maxime Labelle
Thursday, July 29, 2004

1) UTF-32 is not widely supported at all, requiring 4 byte characters. UTF-16 is probably the best compromise right now unless you really need alot of characters not in the base plane.

2) UCS-2 is _UNICODE. It is subtely different than UTF-16 as UCS-2 does not let you get out of the base plane.

3) HTML files are 8 bits, so no, UTF-16 doesn't work on HTML.

4) UTF-8 support for 4-6 byte characters works just fine. If you don't have the fonts to represent the characters, that's not HTML's fault.

5) Don't use ANSI Win32 string functions for UTF-8. First convert back to wide characters before doing string manipulation.

Windows natively supports UCS-2, but .Net goes whole hog UTF-8. In .Net, 4 byte character sequences are only supported by reading the string as a byte array and using the System.Text functions to convert to 2 byte character sequences for UTF-16.

Multi-byte is not UTF-8. UTF-8 is a particular way of representing all the Unicode characters using anywhere from 1 to 6 bytes. Multi-byte is a different particular way of representing characters using 1 or 2 bytes, and does not cover all of Unicode.

UTF-8 is not "natively" supported in Windows. It's just a way to interchange documents with systems that only support 8 bit characters (like HTML).

Ankur
Thursday, July 29, 2004

UNICODE in Win32 isn't not UTF-16 it's UCS-2.  UCS-2 is always 2 bytes per code point and you cannot encode characters beyond the Basic Multilingual Plane (BMP).

In the early days, 2 bytes was enough precision to encode all the code points which is why Windows uses it. 

However, when it was determined that 2 bytes did not provide enough values for all possible code points, UTF-16 and UTF-8 where created (along with UCS-4 which uses 4 bytes per code point).  It should be noted that this was done AFTER unicode support was added to the Windows API.

So to answer your questions:
"How well is UTF-16 supported on Windows?"

Not at all.  It supports UCS-2.

"Does it support 4-bytes long character sequences?"

Doubtful based on the above.

"Isn't UTF-8 really supported on Win32?"

Nope.  UTF-8 is rarely used internally by any software or operating system.  It's generally converted to UCS-2 and used that way.

"What is a multi-byte character string exactly?"

In Windows, it's UCS-2 string.

Almost Anonymous
Thursday, July 29, 2004

Actually, a multi-byte character string in Windows is NOT a UCS-2 string. That's what's called "wide character string."

Multibyte character strings use encodings like SHIFT-JIS or UTF-8, where a single "character" can be represented by a variable number of bytes, and you can't do random access to a particular character, but must instead traverse character by character from the beginning.

With a UCS-2 string, every character is exactly 2 bytes, so random access is possible.

In general, on NT class OS's, you work with UCS-2 strings; on Win9x, you're stuck with multibyte and whatever the current code page is.

Chris Tavares
Thursday, July 29, 2004

"Actually, a multi-byte character string in Windows is NOT a UCS-2 string. That's what's called "wide character string.""

Oh yes..  I got multi-byte and wide character mixed up.  I actually used to do multi-byte work (translating English to Chinese in a VB application) on Windows 98.  Sooo ugly.  ;)

Almost Anonymous
Thursday, July 29, 2004

Ah, ah !

Thanks a lot. That is much clearer now.

Maxime Labelle
Friday, July 30, 2004

How about Java. I know that a Char inJava is 2 bytes length.
Is it UCS-2, UTF16 or something ?

Olivier
Friday, July 30, 2004

Java supports the Unicode 3.0 character set (that is, in the BMP : the characters fitting in  \uxxxx').
I suppose that the internal encoding method is implementation-specific (well, let's suppose that they just use two bytes. I think we can call it UCS-2 --  this is not UTF-16, because UTF-16 includes characters outside the BMP).

This should change is Java 1.5. In case you're interested, read http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

Pakter
Friday, July 30, 2004

"Nope.  UTF-8 is rarely used internally by any software or operating system.  It's generally converted to UCS-2 and used that way."

Quite the reverse, what you've described really holds true only on one platform from one manufacturer.

UTF-8 is the recommended encoding for all Unix systems, and is the de jure standard (or "recommendation") for most W3C languages and protocols and IETF protocols. It's also the de facto standard pretty much everywhere else, except on Windows (on BeOS for example it's hardly mentioned, but actually all the APIs are UTF-8 only)

Popular implementations like the GNU i18n and l10n libraries use UTF-8 almost throughout, with just a few places converting to UCS-4. The recommendation from i18n and Unicode experts on Linux systems especially is to stay with UTF-8 throughout your program. It would be pointless to convert to UCS-2 since that's obsolete and now found more or less exclusively on legacy Windows systems.

There are several cute features of UTF-8 which make it much easier and faster to code for than UTF-16, and of course it's much more compact than UCS-4. For example counting the number of characters in a nul-terminated UTF-8 string just requires a modification to the comparison (a check for 11xxxxxxb) compared to a similar count in ASCII. UTF-8 also doesn't need crazy 1960's-style begin-document markers because it's endian-neutral.

An earlier poster pointed out the decision to use UCS-2 in Windows predates the use of Plane 1 and so all the APIs changed for it by Microsoft are now obsolete and will eventually need a further redesign. It would be sensible to choose UTF-8 for this redesign, but politically it's probably going to have to be UTF-16 despite its shortcomings.

Nick Lamb
Sunday, August 01, 2004

Windows (2000 and later) operating systems do support,
to some degree, the full unicode character set, via UCS-2
surrogates.  Q.v.  http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_192r.asp

Tony Kimball
Friday, August 13, 2004

*  Recent Topics

*  Fog Creek Home